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This  report  summarizes  the  activities  supported  by  DOD  contract  N00014- 
Bl-D-2151  (ARPA  order  number  4095)  and  the  results  of  those  activities.  The 
period  of  this  contract  was  originally  scheduled  to  be  from  80  Dec  19  trough  81 
Dec  19.  but  it  was  extended  was  extended  by  a  no-cost  extension  until  82  Jan  29. 
It  was  succeeded  by  DOD  contract  N00014-82-C-2087  (ARPA  order  number  4095) 
which  supports  a  continuation  of  research  activities  in  the  same  area.  We  view 
the  former  contract  as  supporting  a  pilot  project,  initiating  a  research  program 
that  is  still  under  way.  For  this  reason,  the  following  pages  will  have  more  of  the 
flavor  of  a  interim  report  than  a  final  summation. 

A  variety  of  research  activities  are  supported  by  this  project,  all  sharing  the 
common  theme  of  hardware  and  software  architectures  for  distributed  comput¬ 
ing.  A  related  DARPA-sponsored  project  at  another  institution  is  developing  a 
very  high-bandwidth  communications  medium  (in  the  multi-gigabit/sec  range) 
using  LSI  laser  technology  and  fiber  optics.  Such  a  communications  medium 
suggests  the  possibility  of  attacking  computationally  intensive  tasks  by  connect¬ 
ing  a  large  number  (at  least  103)  of  processing  elements  in  an  MIMD  multicom¬ 
puter.  To  exploit  the  fruits  of  such  an  organization,  however,  many  difficult 
questions  of  high-quality  system  architecture,  algorithmic  design,  and  software 
engineering  must  be  solved. 

Our  research  may  be  grouped  into  two  broad  activities.  First,  we  are  carry¬ 
ing  out  an  continuing  project  to  construct  and  experiment  with  a  prototype 
operating  system  for  such  a  multicomputer.  Second,  we  have  pursued  several 
theoretical  investigations  concerning  system  architecture  and  algorithmic 
design,  testing  our  conclusions  with  a  combination  of  mathematical  analysis, 
simulation,  and  experimental  implementation.  We  will  describe  first  the  theoret¬ 
ical  studies  and  then  the  operating  systems  implementation  activities. 
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2.  THEORETICAL  STUDIES 
2.1.  Process  Migration 

Two  essential  problems  must  be  solved  before  process  migration  for  load 
balancing  becomes  successful.  The  first  problem  is  one  of  policy:  When  should  a 
process  be  migrated?  The  second  is  one  of  implementation:  How  is  a  process 
migrated?  Professor  Finkel,  in  collaboration  with  Professor  Ray  Bryant,  has 
investigated  the  first  issue  [1].  They  developed  a  new  scheduling  algorithm  for  a 
multicomputer  connected  in  a  point-to-point  fashion. 

The  algorithm  is  both  distributed,  in  that  every  processor  runs  the  same 
algorithm,  and  stable,  in  that  collective  load  balancing  decisions  will  not  cause 
unnecessary  overloading  of  a  processor  in  the  network.  The  migration  algorithm 
assumes  that  any  process  may  run  equally  well  on  any  machine,  that  the  cost  of 
migration  is  proportional  to  the  memory  requirements  of  the  process,  that  main 
memory  is  not  a  limiting  constraint  on  any  machine.  Simulations  were  per¬ 
formed  in  Slmpas  [2]  to  test  the  algorithm. 

The  first  part  of  the  algorithm  is  to  estimate  the  current  load  independently 
on  each  machine,  so  that  adjacent  machines  with  different  loads  cam  exchange  a 
process.  Load  is  estimated  as  the  time  it  would  take  am  average  new  process  to 
complete  on  the  machine,  given  that  each  process  currently  on  the  machine 
executes  for  a  time  equal  to  its  execution  time  so  far.  Theoretical  justification 
for  this  measure  can  be  found  in  the  report  [l]. 

The  second  part  of  the  algorithm  is  to  find  a  neighbor  that  has  a  different 
load.  Several  versions  of  the  pairing  algorithm  [3]  were  evaluated.  This  algo¬ 
rithm  dynamically  creates  mutually  agreeable  pairings  between  adjacent 
machines  to  allow  them  to  exchange  a  process. 


The  third  part  of  the  algorithm  is  to  determine  which  processes,  if  any, 
should  be  sent.  Each  process  on  the  heavier-loaded  machine  is  considered  a 
potential  candidate  for  migration.  Its  completion  time  is  estimated  both  for 
keeping  it  where  it  is  and  migrating  it.  That  process  with  the  best  improvement 
in  service  ratio  is  migrated,  and  then  selection  repeats  to  find  other  processes 
that  might  profitably  migrate  as  well. 

Simulation  results  showed  that  even  when  the  load  of  the  network  was  fairly 
uniform,  the  migration  algorithm  significantly  reduced  the  variance  of  the  load 
and  increased  throughput.  When  the  network  was  unbalanced,  a  condition  main¬ 
tained  by  directing  all  new  arrivals  to  the  same  small  subset  of  machines,  migra¬ 
tion  was  able  to  convert  an  instable  situation  into  a  stable  one. 

2.2.  Processor  Interconnection  Strategies 

For  several  years,  the  principal  investigators  have  been  investigating 
graph-theoretical  questions  concerning  optimal  topologies  for  interconnecting 
large  numbers  of  processing  elements  [4,5].  Recently  Will  Leland,  a  graduate 
research  assistant  supported  by  the  project,  completed  doctoral  research  in 
this  area  under  the  direction  of  Professor  Solomon 

A  large  network  of  processing  elements  connected  by  point-to-point  com¬ 
munications  lines  may  be  modeled  by  an  undirected  graph  in  which  the  vertices 
represent  processing  elements  and  the  edges  represent  communications  lines. 
Several  figures  of  merit  are  possbile  for  evaluating  such  a  graph.  If  each  proces¬ 
sor  is  to  be  directly  connected  to  every  other,  the  graph  must  be  AJ,,  the  com¬ 
plete  graph  on  n  vertices.  However,  in  such  a  graph,  as  the  number  n  of  ver¬ 
tices  increases,  the  number  of  communications  lines  increases  as  n8,  and  the 
dagrrae  (number  of  neighbors  of  each  vertex)  increases  as  n.  In  many  applica¬ 
tions,  one  or  the  other  of  these  costs  is  unacceptable.  In  particular,  the  design 
of  the  individual  processing  elements  may  limit  the  number  of  neighbors  to  a 


constant  d,  independent  of  n.  We  may  comply  with  such  a  restriction  if  we  allow 
communication  between  processing  elements  to  be  indirect,  through  intermedi¬ 
ate  processing  elements.  In  this  case,  we  say  the  distance  between  two  vertices 
is  the  number  of  edges  (communications  lines)  in  the  shortest  path  connecting 
them,  and  the  diameter  of  the  graph  is  the  largest  distance  between  any  two 
vertices. 

Briefly  stated,  then,  the  problem  Is  to  find  a  graph  with  smallest  possible 
diameter  Jfe  for  a  given  number  n  of  vertices  and  a  given  bound  d  on  the  degree. 
For  may  years,  an  easily  derived  upper  bound  has  been  known  for  the  number  of 

vertices  in  a  graph  with  diameter  k  and  degree  d:  N  <  1+-  ^g((d  -1)*  -l]. 

This  bound  can  be  derived  from  a  very  simple  argument,  the  details  of  which 
make  it  seem  likely  that  the  bound  is  unduly  optimistic.  It  has  been  proved  that 
for  d  i  3  at  most  three  graphs  actually  attain  the  bound  (so  the  bound  is  too 
large  by  at  least  1),  and  for  degree  3,  the  bound  is  too  large  by  at  least  2  (for 
almost  all  diameters).  Aside  from  these  rather  disappointing  results,  no  better 
bound  has  ever  been  discovered. 

This  bound  can  be  restated  as  a  lower  bound  on  diameter: 
k  as  log  N/  log(d-l).  Previous  researchers  have  considered  graphs  in  which  the 
diameter  varies  as  some  power  of  the  number  of  vertices.  Others  have  proposed 
families  of  graphs,  such  as  the  hypercube,  that  require  an  increase  of  both  d  and 
k  to  increase  N.  The  best  previously  announced  infinite  families  of  graphs  for 
small  fixed  degrees  are  2  Ig  N  for  degree  3  (a  complete  binary  tree  achieves  this 
bound)  and  Ig  N  tor  degree  4  (achieved  by  the  deBruijn  graph),  where  '‘Ig” 
denotes  the  base-2  logarithm.  Solomon  and  Leland  discovered  a  new  family  of 
degree-3  graphs  with  diameter  1.5  Ig  N.  Leland  was  able  to  extend  these  ideas, 
improving  the  constant  slightly  and  applying  the  ideas  to  larger  diameters. 
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The  preceding  paragraphs  consider  the  asymptotic  behavior  of  the  diame¬ 
ter  as  the  number  of  vertices  approaches  infinity.  Another  approach  is  to 
attempt  to  find  large  graphs  for  certain  specific  values  of  k  and  d.  Using  various 
techniques,  including  computerized  heuristic  search,  Leland  was  able  to  find 
many  new  large  graphs  with  diameter  and  degree  less  than  or  equal  to  10.  His 
results  were  reported  in  [6]. 

Diameter  is  not  the  only  figure  of  merit  for  an  interconnection  topology. 
Another  important  consideration  is  connectivity.  Multiple  paths  between  pairs 
of  nodes  are  desirable,  both  to  eliminate  traffic  bottlenecks,  and  to  improve 
resistance  to  node  failures.  Leland  considered  the  problem  of  connectivity  from 
several  points  of  view.  He  showed  that  the  connectivity  of  his  new  family  of 
graphs  (in  the  graph-theoretical  sense  of  minimal  number  of  vertices  that  must 
be  remove  to  partition  the  graph  into  two  or  more  disconnected  parts)  was  very 
good.  He  analyzed  expected  distribution  of  traffic  among  vertices  under  the 
assumption  that  the  source  and  destination  of  a  message  were  chosen  randomly 
from  a  uniform  distribution.  Finally,  he  defined  a  new  quantity,  called  the  aver¬ 
age  random  fatiset  size  (ARFS),  which  is  the  expected  size  of  a  randomly-chosen 
set  of  vertices  sufficient  to  partition  the  graph,  and  studied  this  quantity  for 
various  families  of  graphs. 

2.3.  Distributed  Algorithms 

Our  work  in  distributed  algorithms  has  addressed  a  general  and  important 
problem:  How  much  speedup  is  possible  with  n  machines  running  an  algorithm 
that  employs  messages  as  the  means  of  cooperation  between  machines?  A 
recent  PhD  thesis  by  Jack  Flshburn.  under  the  supervision  of  Raphael  Finkel  [7], 
sheds  light  on  the  question  by  presenting  and  analyzing  severed  parallel  algo¬ 
rithms  for  multicomputers. 


First,  two  distributed  algorithms  for  a-fi  search  are  presented  for  trees  of 
processors.  Each  processor  is  an  independent  computer  with  its  own  memory 
and  is  connected  by  communication  lines  to  each  of  its  nearest  neighbors.  The 
first  algorithm,  Tree  Splitting,  was  implemented  on  Arachne  and  its  behavior  was 
measured.  A  formal  analytical  model  was  developed  that  predicts  best-case, 
average,  and  worst-case  speedup.  In  the  worst  case,  speedup  is  Vn  for  n  pro¬ 
cessors;  in  the  best  case,  speedup  is  n.  The  second  algorithm,  Mandatory  Work 
First,  was  also  analyzed.  Its  behavior  is  bounded  by  fairly  complex  expressions, 
but  for  reasonable  assumptions  about  the  problem,  speedup  is  about  n°  ". 
Related  publications  include  [8, 7, 9, 10, 11, 12] 

Next,  numerical  algorithms  that  exhibit  a  locally-defined  iterative  charac¬ 
ter  were  investigated.  Such  algorithms  include  solution  of  the  Dirichlet  prob¬ 
lem.  Natural  multicomputer  adaptations  can  be  built  for  these  algorithms.  The 
thesis  investigates  two  interconnection  topologies,  a  grid  and  a  tree.  The  results 
can  be  generalized  to  situations  involving  arbitrarily  shaped  domains  of  any 
dimensionality.  Formal  analysis  derives  the  running  time  of  the  grid  and  tree 
algorithms  with  respect  to  per-message  overhead,  per-point  communication 
time,  and  per-point  computation  time.  The  overall  result  is  that  the  larger  the  ^ 
problem,  the  closer  the  algorithms  approach  optimal  speedup. 

Finally,  the  thesis  considers  adaptations  of  large-network  algorithms  to 
small  networks.  A  large-network  algorithm  requires  n  machines  tor  a  problem  of 
size  n.  The  thesis  presents  a  general  method  for  transforming  large-network 
algorithms  into  quotient-network  algorithms  that  solve  problems  of  size  n  with 
fewer  processors.  The  transformation  allows  algorithms  to  be  designed  assum¬ 
ing  any  number  of  processors.  Implementing  such  algorithms  on  a  quotient  net¬ 
work  results  in  no  loss  of  efficiency,  and  often  a  great  savings  in  hardware  cost. 
This  work  has  been  published  [13]. 


a  OPERATING  SYSTEMS  IMPLEMENTATION 
ai.  History 

The  current  project  had  its  origins  in  several  equipment  grants  from  NSF 
(#MCS77-08988,  #MCS78-06809,  #MCS79-07516,  and  #MCS80-06499)  for  research 
in  the  application  of  distributed  computer  systems.  With  these  funds  we  pur¬ 
chased  five  Digital  Equipment  LSI-1 1/03  computers  and  a  larger  PDP-11/40 
software  development  host.  We  also  acquired  enough  point-to-point  commur 
tions  interfaces  to  connect  the  5  small  machines  to  each  other  and  tc  ° 
development  host.  We  developed  the  Roscoe  operating  system  [14. 15. 16. 17 
later  called  Arachne,  for  these  machines.  Roscoe  was  the  first  operational  n 
computer  operating  system  written  explicitly  for  a  local  network.  This  project 
was  supported  in  part  by  the  Graduate  Research  Foundation  of  the  University  of 
Wisconsin,  which  provided  summer  support  for  the  principal  investigators  and 
salaries  for  graduate  research  assistants. 

More  recently.  NSF  funds  have  been  applied  to  upgrade  both  the  processors 
and  the  interconnection  hardware.  We  currently  have  8  Digital  Equipment 
PDP— 11/23  computers  connected  using  a  communications  interface  called  the 
Megalink,  manufactured  by  Computrol  Corporation  of  Danbury,  Connecticut. 
These  computers  are  not  only  about  2.5  times  as  fast  as  the  previous  ones,  but 
also  have  memory  management,  allowing  us  to  build  significantly  larger  pro¬ 
grams.  The  Megalink  is  a  1-megabit/second  broadband  contention  bus  with 
carrier-sense  but  not  collision-detection  capabilities. 

More  recently,  the  Computer  Sciences  Department  has  acquired  a  Digital 
Equipment  VAX/780,  which  has  begun  to  serve  as  our  development  host.  Pro¬ 
grams  are  cross-compiled  on  the  VAX  and  are  then  loaded  through  the  Megalink 
into  the  PDP-11  computers. 


B 


3.2.  Enhancements  to  Arachne 

We  have  brought  Arachne  to  a  stable  state  from  which  we  can  see  how  the 
hardware  influenced  our  design  choices  and  how  they,  in  turn,  led  to  the 
behavior  of  the  entire  operating  system  [19].  Locally-designed  ROM  bootstrap 
programs  were  developed  and  placed  in  each  machine  to  simplify  initial  loading. 
We  revised  Arachne  to  use  the  new  memory -management  facility.  This  change 
was  harder  than  it  first  appeared,  since  ability  to  access  all  memory  directly  was 
assumed  in  many  places  in  the  kernel.  One  immediate  advantage  wets  that  more 
and  larger  programs  could  fit  into  each  machine,  since  there  is  more  physical 
memory  and  processes  can  share  code.  Another  advantage  is  that  processes 
and  the  kernel  are  protected  from  each  other.  However,  Arachne  does  not  do 
swapping,  so  no  more  processes  can  run  on  one  machine  than  will  fit  in  physical 
memory  at  once,  and  stacks  and  data  areas  cannot  grow  once  a  process  has 
started. 

With  the  move  to  a  contention  communications  medium  we  had  to  deal  with 
collisions.  In  one  preliminary  experiment,  one  processor  "listens"  and  reports 
what  it  hears,  while  the  remaining  seven  processors  attempt  to  send  it  packets 
as  fast  as  they  can.  Each  sits  in  a  tight  loop  waiting  for  the  carrier  to  drop  and 
then  immediately  sends  a  4096-byte  packet.  We  find  that  the  "listener"  hears 
about  100  packets  at  a  time  without  errors,  followed  by  a  brief  burst  of  packet 
collisions.  Because  the  senders  are  all  executing  the  same  algorithm  on  the 
same  hardware,  there  is  a  high  degree  of  coupling,  so  that  two  senders  who  col¬ 
lide  once  will  often  collide  several  times  in  a  row.  In  this  sense,  the  experiment 
is  a  "worst-case"  test.  It  seems  to  indicate  that  the  absence  of  collision  detec¬ 
tion  is  not  as  severe  as  might  be  expected,  since  even  under  worst  case,  colli¬ 


sions  are  rare. 
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Nontheless,  since  collisions  are  possible,  we  had  to  develop  protocols.  We 
divide  our  protocol  architecture  into  two  levels:  intermachine  and  interprocess. 
The  intermachine-level  protocol  provides  reliable,  in-order  delivery  of  frames 
from  one  machine  to  another,  avoiding  recopying  of  data  where  possible,  and 
allowing  for  recovery  when  frames  fail  to  be  delivered  either  because  of  tem¬ 
porary  contention  on  the  line,  unavailability  of  the  destination  computer,  or 
unwillingness  of  a  recipient  to  receive  the  message.  We  use  a  simple 
alternating-bit  stop-and-wait  protocol  because  end-to-end  delay  is  very  short. 


The  interprocess-level  protocol  multiplexes  several  links  on  one  inter¬ 
machine  connection.  We  debugged  these  protocols  by  simulating  them  in  the 
Simpas  event-driven  simulation  language  [2].  This  technique  of  debugging  pro¬ 
tocols  was  very  natural;  the  protocols  themselves  are  structured  as  actions  trig¬ 
gered  by  asynchronous  events. 


3.3.  Charlotte 

While  we  were  adapting  Arachne  to  the  new  hardware,  we  have  begun 
designing  its  successor,  Charlotte,  which  builds  on  lessons  learned  from  the 
implementation  of  Arachne.  Charlotte  differs  from  Arachne  in  several  respects. 

One  difference  is  intended  use  of  the  operating  system.  Arachne  was  a  pilot 
project  to  test  our  ideas  for  building  multicomputer  operating  systems;  Char¬ 
lotte  is  intended  as  a  production  operating  system  to  be  used  by  a  wide  research 
community.  A  real  user  community  tests  more  realistically  the  strengths  and 
weaknesses  of  any  software.  To  attract  such  a  community  Charlotte  needs  a 
variety  of  useful  utility  programs,  including  text  editors,  compilers,  and  com¬ 
mand  interpreters. 

Another  difference  lies  in  the  basic  communications  structure.  Interpro¬ 
cess  communication  in  Arachne  is  based  on  the  notion  of  a  link,  which  is  a  sim¬ 
plex  communications  path  from  one  process  to  another.  A  client  holding  a  link 


to  a  server  from  which  it  desires  service  ”"ist  create  a  link  to  itself  and  enclose 
it  in  a  request,  so  that  the  server  can  respond.  We  found  in  Arachne  that  when 
two  processes  need  to  communicate  they  usually  need  a  duplex  communications 
path,  even  if  the  bulk  of  the  information  transfer  is  in  one  direction.  Therefore, 
we  decided  to  introduce  the  concept  of  a  duplex  link  in  Charlotte.  Another 
motivation  for  duplex  links  is  our  intention  to  experiment  with  the  use  of  pro¬ 
cess  migration.  A  problem  with  simplex  links  as  implemented  in  Arachne  is  that 
no  information  about  the  source  end  of  the  link  is  stored  available  at  the  desti¬ 
nation  end.  This  implementation  interferes  with  process  migration,  since  it  is 
Impossible  for  the  operating  system  kernel  of  the  destination  process  to  find  all 
potential  senders  to  inform  them  about  the  move. 

One  of  the  design  goals  of  Arachne  was  to  take  as  much  function  as  possible 
out  of  the  kernel  and  put  it  into  utility  processes,  which  are  treated  like  ordi¬ 
nary  processes  by  the  kernel,  but  which  provide  "systems"  services  such  as  file 
and  other  resource  management.  An  important  utility  process  in  Arachne  was 
the  Resource  Manager,  which  is  responsible  for  allocating  the  processor 
resource.  The  kernel  is  responsible  for  providing  the  necessary  implementation 
primitives  for  allocating  memory  and  starting  processes,  while  the  Resource 
Manager  is  responsible  for  policy  decisions  such  as  which  processor  to  start  a 
new  process  on.  Charlotte  carries  this  trend  further.  The  functions  of  the 
Resource  Manager  will  be  divided  among  a  Starter  process  that  initiates  new 
processes,  a  Switchboard  process  that  provides  a  directory  service,  registering 
and  dispensing  links  to  other  utility  processes,  and  a  Connector  process,  which 
helps  set  up  links  with  a  multi-process  applications  program.  These  ideas  will  be 
described  more  fully  in  a  forthcoming  report. 

A  final  difference  between  Arachne  and  Charlotte  is  implementation 
language.  We  were  both  pleased  and  disappointed  with  C  as  the  implementation 
language  for  Arachne.  One  the  positive  side,  it  allowed  us  to  be  considerably 
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more  productive  that  we  could  have  been  had  we  tried  to  implement  Arachne  in 
assembler  language,  while  allowing  sufficient  flexibility  to  permit  us  to  avoid  use 
of  assembler  language  almost  entirely.  On  the  negative  side,  the  C  compiler  pro¬ 
vides  almost  no  assistance  in  catching  runtime  errors.  Indeed,  the  design  of  the 
C  language,  with  its  heavy  reliance  on  pointer  variables,  makes  such  things  as 
subscript  and  pointer  checking  nearly  impossible.  Ve  found  that  an  undue 
amount  of  our  time  was  spent  tracking  down  errors  due  to  subscript  violations, 
dangling  pointers,  and  incorrect  numbers  of  arguments  to  procedures.  These 
errors  are  particularly  hard  to  And,  since  they  may  not  manifest  themselves 
until  long  after  they  occur,  and  since  they  may  cause  unrelated  thoroughly 
debugged  routines  to  malfunction.  As  a  result  of  these  experiences,  we  have 
decided  to  use  Modula  [20]  as  the  implementation  language  for  Charlotte.  We 
have  a  locally-written  compiler  for  Modula  that  is  capable  of  producing  code  for 
V  variety  of  target  machines  including  the  PDP-11  and  the  VAX.  Having  our  own 

locally  written  and  locally  maintained  compiler  allows  us  to  modify  the  language 
to  meet  our  particular  needs  and  may  allow  experimentation  with  syste ms- 
implementation  language  design  in  the  future. 

During  the  first  year  of  this  project,  the  design  of  Charlotte  was  nearly  com¬ 
pleted  and  implementation  is  nearly  ready  to  commence. 
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Abstract 

We  present  and  analyze  several  practical  parallel  al¬ 
gorithms  for  multicomputers. 

Chapter  four  presents  two  distributed  algorithms  for 
implementing  alpha-beta  search  on  a  tree  of  processors. 

Each  processor  is  an  independent  computer  with  its  own 
memory  and  is  connected  by  communication  lines  to  each  of 
its  nearest  neighbors.  Measurements  of  the  first 
algorithm's  performance  on  the  Arachne  distributed  operat¬ 
ing  system  are  presented.  For  each  algorithm,  a  theoreti¬ 
cal  model  is  developed  that  predicts  speedup  with  arbi¬ 
trarily  many  processors. 

Chapter  five  shows  how  locally-defined  iterative 
methods  give  rise  to  natural  multicomputer  algorithms.  We 
consider  two  interconnection  topologies,  the  grid  and  the 
tree.  Each  processor  (or  terminal  processor  in  the  case  of 
a  tree  multicomputer)  engages  in  serial  computation  on  its 
region  and  communicates  border  values  to  its  neighbors  when 
those  values  become  available.  As  a  focus  for  our  investi- 


gation  we  consider  the  numerical  solution  of  elliptic  par¬ 
tial  differential  equations.  We  concentrate  on  the  Diri- 


chlet  problem  for  Laplace"s  equation  on  a  square  region, 
but  our  results  can  be  generalized  to  situations  involving 
arbitrarily  shaped  domains  (of  any  number  of  dimensions) 
and  elliptic  equations  with  variable  coefficients.  Our 
analysis  derives  the  running  time  of  the  grid  and  the  tree 
algorithms  with  respect  to  per-message  overhead,  per-point 
communication  time,  and  per-point  computation  time.  The 
overall  result  is  that  the  larger  the  problem,  the  closer 
the  algorithms  approach  optimal  speedup.  We  also  show  how 
to  apply  the  tree  algorithms  to  non-uniform  regions. 

A  large-network  algorithm  solves  a  problem  of  size  N 
on  a  network  of  N  processors.  Chapter  six  presents  a  gen¬ 
eral  method  for  transforming  large-network  algorithms  into 
quotient-network  algorithms,  which  solve  problems  of  size  N 
on  networks  with  fewer  processors.  This  transformation  al¬ 
lows  algorithms  to  be  designed  assuming  any  number  of  pro¬ 
cessing  elements.  The  implementation  of  such  algorithms  on 
a  quotient  network  results  in  no  loss  of  efficiency,  and 
often  a  great  savings  in  hardware  cost. 
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by  communications  lines,  in  addition,  all  leaf  processors  diagonal  are  all  aero) ,  these  methods  yield  a  speedup  of 

have  access  to  a  large  common  memory  that  specifies  a  q-eue  approximately  p/m  with  p  processors, 

of  tasks  for  each  processor.  These  queues  are  managed  by  a 


0  -o  • 

X 

• 

• 

x 

4 

4 

0  X  c  41 

4 

X 

4 

4 

4 

U 

St L 

4 

4  K 

■H 

X 

E 

4 

X 

Ul 

X 

4 

1 

i 

4 

u 

X  4  ^4 

x 

u 

0 

E 

X 

1 

0 

JJ 

V 

4 

4 

JJ 

O  ^  C  <x 

a 

-C  0 

K 

4 

0 

a 

a 

4 

u 

X 

6  -*  > 

4 

X  C 

u 

u 

4 

4 

4 

u 

jJ 

3 

W  4  0  4 

0 

u 

X 

O' 

\o 

a 

» 

X 

X 

JJ 

4 

-O  X 

0  *0 

C 

0 

m 

• 

0 

u 

JJ 

4 

4  X 

4 

d 

4 

X 

X 

•o 

H 

4 

O'  C7i 

X  -*J  X 

JC 

-  *  4 

■x  >i 

a 

y 

4 

r-i 

4 

W4 

X 

u 

e 

£ 

X  4  3  0 

X 

X  >,4 

O  4 

c 

3 

X 

4 

—t 

0 

JJ 

4 

•H 

O 

X  O  X 

•H  fl  T3  £ 

C  rH 

X 

> 

4 

O 

X 

i-H 

c 

u 

XX  C 

c 

•O  •*<  X 

<  3 

4 

0 

3 

X 

JJ 

4 

H 

O' 

4 

4  4-* 

X 

4  U  3 

4 

0 

c 

X 

•o 

O 

C 

H 

4 

••H 

• 

a>  X 

0 

X  O'  0 

U-4  O 

E 

.* 

0 

c 

w 

e 

0 

4 

4 

W 

“O  4»  c 

0  C  4  4 

0  4 

1 

0 

4 

M 

O 

IM 

4 

-H  k-  <H  > 

J-h  h 

X 

0 

H 

H 

0 

4 

Jj 

0 

0 

« 

4  X  .-*  0 

I 

tt  X  4  *0 

4 

X 

rH 

H 

4 

4 

4 

4) 

»H  ■*■«  U 

4  O'  d  • 

>.-0 

4 

■H 

4 

■ 

+J 

> 

X 

>* 

4 

>*  *  X 

4  >t  C  4  >,  4  U 

* 

JC 

■x 

u 

4 

a 

y 

<-H 

0 

>•  X  X 

X  X  4  4 

1  tJ  o 

4 

4 

4 

u 

4 

X 

u 

1 

TJ  41 

X  to  x  u 

>  tJ 

X 

* 

X 

JJ 

3 

•o 

o 

0 

0 

0 

4  >  -H 

4  4  4  X 

X 

0 

o» 

0 

4 

JJ 

4 

•H 

4 

a 

£ 

4  4  3 

-0  4  4  4  4 

1  1 

c 

d 

c 

u 

4 

3 

X 

4 

•H 

a 

w  u 

0  «  * 

4 

•X 

4 

VM 

p4 

* 

■ 

H 

w  o  *W 

O  -o  a 

0 

4 

>t 

4 

4 

4 

0 

< 

4  « 

4  ®  *D  -X 

•x 

4 

4 

4 

4 

ft 

X 

C 

4  H  -H  C  X 

4 

H 

X 

c 

JJ 

>1 

H 

4-0 

C  4  £  4 

X 

a 

4 

« 

JJ 

4 

0 

u 

d 

4 

-<  4  0 

•x  c  d 

M 

X 

4 

— 4 

a 

o 

•H 

4  O' 

Z  4X0 

■ 

4 

X 

fi 

0 

4 

Jj 

3 

>« 

«»H 

4  4 

•o  *o  4  i 

X 

X 

-4 

JJ 

a 

4 

X 

X 

0 

X  u  -o 

4  C  4  4  9 

X 

X 

X 

JJ 

c 

JJ 

4 

3 

U 

«  *J  H 

x  <  -0  H  3 

■X 

4 

0 

•p# 

0 

a 

•0 

4 

O 

4 

4 

X  4 

X 

4 

*M 

O' 

5 

4 

4 

e 

4 

0. 

4  4  ■*■« 

-o 

0 

4 

0 

X 

d 

u 

jj 

•o 

4 

o 

X 

X  X  >, 

>1  c  o 

o» 

X 

* 

-w 

4 

4 

a 

«H 

4 

Eh  X 

0  «<  H 

H 

o 

X 

jj 

X 

4 

4 

JJ 

1 

4 

e 

4J 

3 

JJ 

N 

JJ 

0 

H 

4 

4 

4 

a 

tj 

>i 

4 

» 

3 

4 

X 

JC 

C 

E 

4 

r-4 

V4 

E 

C 

H 

z 

U 

•X 

0 

•o 

0 

4 

X 

^H 

H 

w 

o 

x 

H 

a 

4 

o 

d 

«u 

jj 

JJ 

0 

• 

M 

4 

8 

$ 

■ 

4 

■  ( 

e 

k4 

JJ 

H 

4 

4 

0 

a 

<M 

4 

4 

u 

0 

4 

a 

L) 

4 

4 

o 

4 

0 

4 

■O 

X 

0 

o 

a 

0 

5 

E 

4 

c 

H 

O' 

x 

a 

0_ 

4 

X 

4 

JJ 

■ 

4 

r-H 

H 

u 

u 

o 

1 

O' 

d 

X 

c 

0 

0 

O 

0 

V 

4 

4J 

3 

0 

JJ 

•H 

H4 

P 

>• 

X 

0 

JJ 

C 

Q 

01 

JJ 

z 

4 

4 

u 

4 

2 

4 

X 

d 

c 

4 

M 

X 

H 

0 

4 

E 

4 

jj 

4 

4 

e* 

a 

a 

4 

4 

*»H 

jj 

JJ 

X 

B 

4 

C 

w 

u 

U 

jj 

Q 

y 

• 

it 

•*4 

E 

4 

S’ 

a 

0 

a 

u 

r«4 

4 

4 

> 

0 

O' 

4 

• 

X 

c 

u 

•*4 

u 

4 

a 

<M 

4 

X 

4 

O' 

O' 

a 

4 

41 

0 

4 

H  £  i)  « 

u  jj  o  4  4 

>>  4  -* 

y  -q  u  m  y 

c  -h  o  > 

4  4  -O  h  O 


ordered  best-first,  Baudet's  method 


V 

1 

CQ~ 

H 

<44 

| 

n 

c 

u 

4) 

ty 

(0 

0 

44 

>H 

1 

0 

o 

u 

T3 

3 

10 

X 

<4-1 

0) 

<44 

O' 

O’ 

0) 

3 

0 

u 

1 

<0 

4) 

4) 

4J 

44 

•*4 

« 

X 

c 

3 

H 

> 

a> 

41 

e 

4J 

•O 

> 

c 

<0 

4J 

-w4 

® 

X 

«0 

<0 

44 

3 

X 

« 

O 

0 

0 

4> 

■o 

14 

•o 

44 

3 

C 

44 

> 

H 

CD 

44 

44 

41 

3 

3 

44 

v— i 

w 

a 

•o 

0 

•vd 

*9 

O' 

* 

•*4 

at 

co 

<0 

•H 

X 

H 

4) 

10 

u 

3 

c 

J 

£ 

0 

S 

c 

E 

CO 

CO 

44 

*0 

3 

•o 

3 

0 

c 

—4 

CO 

X 

3 

a* 

<0 

> 

O’ 

« 

to 

o 

C 

o 

<0 

4> 

E 

4J 

co 

co 

c 

V 

4> 

CO 

® 

>» 

44 

* 

4) 

X 

> 

V4 

4) 

<44 

■H 

X 

e 

4» 

<0 

® 

n 

•ed 

® 

W 

u 

•*4 

® 

E 

> 

0 

44 

44 

<o 

> 

X 

0 

rH 

X 

CO 

4-1 

4J 

CO 

X 

(0 

U 

CT* 

U 

44 

<0 

4» 

C 

44 

« 

u 

41 

44 

a 

Q 

* 

tu 

0 

^  >1  o 

44  X 

a  u 

TJ  4>  E 

u  I  o 

C  u 

X  O  4* 

t*  C 

c 

(0  o 


CO  B) 

*  c  co 

0  0  0 

»o  ■*■«  & 

c  ^ 


u  « 
-  0  o 
C  (0  o 

O  a  3 
■H  0)  B 

±»  o 
-*  o  >* 
a  3  c 

8.  ■  S 

£ 

<M|  O  > 

oi  -  o 

X  X 

a  *  - 

•J  — 

W  •» 

x  -  a 
U  Q  *J 
X  J  *H 
<  M  ® 

r  x  y 

•  u  x 

•  X  < 


o  ~  - 

CO  — 

W  4=  £ 

a>  -*->  4J 

o  a  a 

0  01  41 

3  TJ  U 

CO  — 

o  i  i 


41 

US  r-*  —• 

3  +  + 

^  X  X 

«0  44  44 

>  a  a 

V  41  41 

*0  TJ 


—  CO 

X  U  C 

44  3  U 

a-  **  u 

4)  4)  3 

*3  U  CO 


<Ti  O  ^ 
C4  C* 


o  o  — 

O  C  «H 

3  0  0 

WOO 
3  3 

U|  W  « 


<n  r>  *e  «x>r*co^o^^m^mvor*'aooNO^ 


The  interior  algorithm  interiorcfp  runs  on  interior 
nodes  of  the  processor  tree.  When  inter  iorc(p  is  activated 
it  generates  all  successors  of  the  position  to  be  evaluatei 
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analysis  of  the  serial  algorithm  under  conditions  of  random 
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Note  that  r  la  an  eigenvalue  of  Md  f  if  and  only  if  1/r  is  Then  the  sequence  a^/1,  Sj/2,  a3/3,  ...  either  diverges  to 

a  root  of  V(a) .  Since  C,(a)  ia  a  quotient  of  polynoaiala,  -  co  or  converges. 


tree  architecture.  He  will  choose  a  fairly  straightforward 


la  Insignificant  under  beat-first  order 


4.9  says  that  If  the  narglnal  return 


Fig.  4.4.  Serial  vs.  Parallel  Machines. 


important  problem  in  engineering  and  physics,  the 


1 


multicomputer  architecture  is  the  followings  Arrange  the  units  of  time.  The  computation  phase  takes  p  units  of 

processors  in  a  q  by  q  grid,  with  each  processor  connected  time  to  compute  p2  points  on  each  processor.  We  have 
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the  rectangular  region  into  two  square  regions  of  side  p2 


region  into  n"  regions.  Thus  each  terminal  processor  is 
assigned  a  region  with  pM  points.  Using  methods  similar  to 
Fig.  5.4.  Non-overlap  of  send  and  receive.  those  in  the  two-dimensional  case,  we  can  calculate  aq,  the 

finishing  time  for  one  step,  as 
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From  the  slave's  point  of  view,  a  time  step  starts  the  master  must  wait  a  certain  amount  of  time  x.  At  time 

when  border  points  start  arriving  from  its  master.  At  time  x,  the  master  starts  sending  points  to  the  first  slave 
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synchronous  tree  algorithm  and  the  semi-synchronous  method  given  efficiency  e.  As  the  desired  efficiency  increases  to 

all  give  nearly  n-fold  speedup  on  large  problemsi  The  one  we  must  put  a  larger  and  larger  subproblem  on  each 
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Arachne  *  These  sample  figures  for  p  and  An  entire  tree  machine  need  not  be  devoted  to  a  single 

necessarily  representative  of  real  distributed  problem.  For  example,  a  tree  of  height  3  and  fanout  4  can 


points  from  a  slave  to  its  master  should  include  the 
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Fig.  5.12.  Load  ■  bivariate  nornal  distribution  with  stan' 
dard  deviation  a)  100)  b)  4)  c)  3 >  d)  2.5. 
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Fig.  6.1.  4-PE  4-pin  shuffle  eaulating  8-PE  4-pin  shuffle 
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the  standard  binary  shuffle-exchange  multicomputer  graph 
In  parallel  with  the  general  definitions.  The  general 
definitions  assume  we  have  been  given  an  arbitrary  base 
b  >  1  and  soae  number  N  >  1  that  is  a  multiple  of  b.  in 


actly  n  stages  away  f root  the  destination)  Itom  hers 


Bach 


atructions  provide  •  staple  routing  algor  1 the 


that  addresses  with  equal  digit  subs  are  connected  by  sent  Is  set,  setting  x0  -  0  determines  the  remaining  xt 


L. 


The  trlvalent  graphs  just  constructed  with 


all  within  distance  *  n  -  1.  There  are 


Intuitively,  the  firet  tern  In  the  DE  expression  sets  further  improvements  in  density)  for  comparison,  we 


Table  2.1  Chapter 


•  C  w4 

rH  0  -Q 

a  •*«  a 

>  o 

*•■4  *«*4  « 

u  C 


■  ■ 

a 

<M  ■  >4 


o>  *  a 

C  U  « 

«  •  «M 

£  > 

O 

x  a  a 

a  >  o* 

I  «  c 

O  £  a 

m  £ 

J3  «  O 

3  «  X 

O  -w  « 

•O  *H  | 

««*•  « 
caw 

i  w  g 

a 

•o  a 


a  > 

?  •*  • 

a  £ 

£  a 

o  a  a 

x  u  £ 

a  o«  <u 


o  5 

a  a  «e 

S’  i  3 

a  -4  a 

JZ  0» 

oca 
x  •*  a 

a  a  .c 


zzzzz 

o>  o>  tr  (r  9 


2  Z  Z  Z  2 

v#*  O*  0>  C7*  9* 


2  2  Z  2  2 

Qfl  91  O^  9  9 


2  2  2  2  2 
O*  O'  O'  9*  O' 


2  2  Z  2 

o'  O'  O'  O' 


a  2 

a  a 

•H  £ 

o  o 

>1  X 

o  a  ^  a 

d 

o*  a  i 
•o  e  o*a 
a  a  ch 

t-»  a 

■3? 

•M  J E  «J3 

u  a 

t>  OJS  3 

h  a  a9£ 

a  x  u  o 

e  o 

c  a  x  *o 

a  a  a  x 

c  i  a 

an  a  a 

0  a  i  >« 

u  1 

u  «h  a  n 

*»  >1  >4  a 

e 

«J3  3  3  M  -  Ma.  *. 

a|  m  3  jc  o  a  a  u  >,•*«  0  ai 

^  .....  .  .....  . 

ail 


U  U  U  iH 

a  a  a  a  a 
a  ac  c  3 

U  >t~4  **4  o 

*»  2  J3  X)  *o 


a  a  x> 

a  a  3 

*  >t'  w  o 

*>  ®  u  «o 


0  »8 

*0  u  1 

a  a  js 

•a  e  a 

c  ■*  a 

3  ja  a 

2 

c 

D 


complete  hypercube  variable  0.742  lg^H 


u  la  1  0  and  v  la  1*01,  or  vice  varaa.  Condition  c  la  reaulta  and  the  honoaorphlaa  h  to  show  that  the 
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(10)*  and  v  la  (01)*,  or  vice  veraa. 
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not  handle.  Thus,  all  but  0(N/lg  N)  edges  In  the  even 
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Abstract 

This  paper  describes  a  new  scheduling 
algorithm  for  a  multicomputer  connected  in 
a  point-to-point  fashion.  The  algorithm 
is  both  distributed  (every  processor  runs 
the  same  algorithm)  and  stable  (collective 
load  balancing  decisions  will  not  cause 
unnecessary  overloading  of  a  processor  in 
the  network) .  We  assume  that  process  exe¬ 
cution  times  are  not  known  in  advance  and 
that  inter-processor  transfer  times  are 
non-trivial.  Load  is  balanced  by  migrat¬ 
ing  suitable  jobs  after  they  have  been  run 
long  enough  to  obtain  an  estimate  of  their 
required  service  time.  The  algorithm  is 
suitable  for  job  scheduling  in  a  distri¬ 
buted  time-sharing  system  implemented  on  a 
multicomputer.  Performance  of  the  algo¬ 
rithm  is  investigated  through  simulation. 


A*  Introduction 

Classical  operating  systems  make  job 
scheduling  decisions  either  by  using  some 
a  priori  estimate  of  the  resources  that 
each  job  will  require  (usually  supplied  by 
the  user)  or  by  collecting  statistics  dur¬ 
ing  the  execution  of  each  job  to  predict 
its  overall  resource  requirements.  In  in¬ 
teractive  systems,  initial  estimates  are 
generally  unavailable,  so  gathered  statis¬ 
tics  are  used  instead. 

Recently,  several  researchers  have 
been  designing  and  implementing  distribut¬ 
ed  operating  systems  for  multicomput- 
12  3 

ers  '  '  .  (We  distinguish  a  multicomput¬ 
er,  in  which  processors  communicate  only 
by  sending  messages,  from  a  multiproces¬ 
sor,  in  which  processors  share  memory. ) 
These  operating  systems  assign  newly  ar¬ 
rived  jobs  to  processors  based  on  fairly 
simple  heuristics  (such  as  availability  of 
memory  or  explicit  request  of  the  parent 
job).  In  more  complex  situations,  linear 
programming  or  network  flow  algo¬ 
rithms^'  J ^  have  been  proposed  as 
methods  of  determining  optimal,  static  as¬ 
signments  of  jobs  to  processors  given  that 
the  resource  needs  of  all  jobs  arc  known 


in  advance. 

Dynamic  assignment  of  jobs  to  proces¬ 
sors  has  not  received  as  much  attention, 
in  spite  of  the  fact  that  it  can  increase 
throughput  and  decrease  response  time  by 
taking  advantage  of  loading  differences 

p  g 

between  processors  '.  However,  even 
dynamic  job  assignment  is  not  a  complete 
solution  to  the  problem  of  scheduling  in¬ 
teractive  tasks  on  a  multicomputer.  Since 
properties  of  a  job  are  not  known  until 
its  behavior  has  been  sampled  by  letting 
it  run,  a  job  must  be  able  to  start  on  one 
machine  and  then  migrate  to  a  more  suit¬ 
able  machine  as  its  resource  requirements 
become  evident . 

Migration  is  by  no  means  an  easy 
task.  Not  only  must  the  code  and  data  be 
moved,  but  also  all  logical  communications 
paths  must  be  rerouted.  Independent  of 
processor  load  questions,  communication 
lines  may  also  suffer  from  overload  and 
may  create  bottlenecks,  and  processes  do 
not  necessarily  function  equally  well  on 
any  machine  of  the  multicomputer.  These 
aspects  of  migration  will  be  ignored  in 
this  paper.  We  will  concentrate  on  migra¬ 
tion  purely  for  the  purpose  of  balancing 
processor  load. 

The  algoritluns  that  we  present  are 
both  distributed  and  stable .  A  distribut¬ 
ed  algorithm  is  important  to  insure 
robustness,  to  prevent  information 
bottlenecks,  and  because  a  centralized 
scheduling  algorithm  is  antithetical  to 
the  concept  of  a  distributed  computer  sys¬ 
tem.  A  scheduling  algorithm  is  stable  if 
all  job  assignment  decisions  are  correct 
for  at  least  the  short  term;  an  algorithm 
would  be  unstable  if  a  job  continually  mi¬ 
grates  around  the  network  without  accom¬ 
plishing  ar.y  useful  work.  This  form  of 
fruitless  migration  is  the  multicomputer 
analog  of  thrashing  on  a  virtual  memory 

system^;  we  call  it  processor  thrashing . 

Processor  thrashing  can  occur  on  a 
multicomputer  because  scheduling  decisions 
made  by  a  processor  are  based  or.  relative¬ 
ly  old  data  (due  to  transmission  delay 


entail  time  overhead. 


between  processors)  and  are  made  indepen¬ 
dently  of  decisions  made  by  other  proces¬ 
sors.  Thus  if  processors  A  and  B  both  ob¬ 
serve  that  processor  C  is  idle,  they  can 
offload  so  much  work  to  processor  C  that 
it  is  now  overloaded  in  relation  to  pro¬ 
cessors  A  and  B.  Jobs  then  can  be  shunted 
back  to  processors  A  and  B  where  the  en¬ 
tire  cycle  can  begin  again. 

Bidding*0  is  the  best-known  example 
of  a  distributed  scheduling  algorithm. 
However,  bidding  requires  that  the  commun¬ 
ications  medium  of  the  multicomputer  allow 
broadcast  messages  and  that  each  processor 
maintain  a  current  “price"  for  resource 
use.  This  price  is  used  to  answer  "re¬ 
quests  for  quotes”  sent  by  other  proces¬ 
sors.  Adjusting  this  price  in  relation  to 

resource  use  is  not  straightforward**. 
Our  algorithm  requires  neither  of  these 
restrictions.  Migration  decisions  are 
based  on  minimizing  the  estimated  response 
time  for  individual  jobs;  if  it  is  es¬ 
timated  that  a  job  would  have  a  quicker 
response  on  a  nearby  processor  ( the  esti¬ 
mate  includes  transfer  time  delay)  then  it 
is  sent  to  that  processor  to  be  run. 

To  clarify  the  scope  of  our  discus¬ 
sion,  we  state  the  following  working  as¬ 
sumptions  : 

1.  All  processors  have  the  same  speed 
and  are  equally  capable  of  running 
any  job.  In  particular,  we  ignore 
the  cost  of  communication  with  such 
entities  as  files,  which  may  differ 
as  jobs  move  from  sites  close  to  the 
files  to  sites  farther  away. 

2.  The  runnable  set  of  processes  is 
scheduled  independently  on  each  pro¬ 
cessor  according  to  a  round- robin 
algorithm. 

3.  A  process  may  be  migrated  from  any 
processor  to  any  of  its  neighbors  at 
any  time.  The  communications  cost 
of  this  transfer  is  proportional  to 
the  size  of  the  job  (the  amount  of 
memory  it  is  using)  plus 
communications-network  queueing  de¬ 
lay. 

4.  Process  migration  is  never  con¬ 
strained  by  memory  size.  We  assume 
either  that  each  processor  has  ample 
memory,  or  that  backing  storage  can 
be  used  at  each  processor  to  hold 
those  jobs  that  cannot  fit  into  main 
memory . 

5.  The  scheduling  algorithm  that  de¬ 
cides  when  and  how  to  transfer 
processes  requires  no  overhead. 
However,  messages  to  achieve 
cooperation  between  processors  do 


6.  The  service  time  for  each  process  is 
defined  as  the  amount  of  tine  that 
process  spends  in  the  running  state 
from  its  start  until  its  termina¬ 
tion.  The  service  time  is  not  known 
by  the  scheduling  algorithm, 
although  it  may  keep  historical  in¬ 
formation  that  allows  it  to  estimate 

the  job  service  time  distribution. 

7.  The  processors  are  interconnected  by 
two-way  point-to-point  communica¬ 
tions  lines,  forming  a  graph  with 
processors  as  vertices  and  communi¬ 
cations  lines  as  edges.  This  graph 
is  not  necessarily  complete,  but  it 
is  connected. 

8.  New  jobs  can  arrive  at  any  machine 
on  the  network. 

The  results  reported  here  are  prel:  tary 

and  based  on  simulation;  we  intend  im¬ 
plement  the  most  successful  variatio  on 
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the  Arachne  operating  system  '  tc  i- 
date  these  simulations.  The  algorit 
will  discuss  can  be  decomposed  ii  ^  j 
parts,  which  we  call  the  load  esti,  .on 
method  and  the  cooperation  method .  The 
following  sections  deal  with  these  methods 
in  detail;  we  then  discuss  the  combined 
algorithm  and  simulation  results. 

2.  Load  estimation 

In  a  single  processor  operating  sys¬ 
tem,  the  remaining  service-time  estimate 
for  any  job  is  commonly  based  on  the 
heuristic  that  if  a  job  has  run  a  long 
time,  it  is  likely  to  run  some  more,  but 
an  actual  remaining  processing-time  esti¬ 
mate  is  not  made.  Instead,  the  job  is 
placed  on  a  queue  shared  by  all  jobs  of  a 
similar  processing-time  history.  For  ex¬ 
ample,  multilevel  feedback  schedulers 
place  a  job  on  queue  k  if  it  has  run  for  k 
quanta.  Jobs  on  queues  with  higher  k  have 
lower  priority.  However,  in  a  distributed 
system,  an  explicit  estimate  of  the 
remaining  processing  time  needed  by  a  job 
is  required  so  that  its  response  time  may 
be  estimated  under  the  assumption  that  it 
is  moved  to  a  different  machine  in  the 
network. 

£.JL  Estimating  Remaining  Processing  Time 

We  consider  four  algorithms  for  es¬ 
timating  how  much  longer  a  job  will  con¬ 
tinue  based  only  on  the  processing  time  it 
has  used  to  date.  The  simplest  is  the 
memoryless  method,  which  assumes  that  all 
jobs  have  the  same  expected  remaining 
time,  independent  of  time  used  so  far. 
Another  simple  algorithm  is  pastrepeats ; 
it  estimates  that  remaining  time  is  equal 


to  time  used  so  far.  If  we  know  the  dis» 
tribution  of  service  times,  the  associated 
d  stribution  method  estimates  that  the 
job's  remaining  processing  time  is  the  ex¬ 
pected  remaining  time  conditioned  by  the 
time  already  used.  For  purposes  of  com¬ 
parison,  the  optimal  method  (which  is  in¬ 
feasible  in  practice)  gives  the  actual 
remaining  service-time  requirement. 

If  we  let  t  be  the  amount  of  time 
used  so  far  by  a  job,  R(t)  be  the  remain¬ 
ing  time  needed  given  that  t  seconds  have 
been  used,  S  represent  the  service-time 
random  variable  (with  density  f(s),  dis¬ 
tribution  F(s),  expected  value  E(S}),  and 
R£(t)  the  scheduler's  estimate  of  R(t), 

then  these  methods  estimate  R(t)  by  the 
following  expressions: 


ets,  it  is  clear  that  we  will  minimize  the 
integral.  Application  of  elementary  cal¬ 
culus  shows  that  setting 

RE(t)  ■  E|  S-t  I  S  >  t) 

will  minimize  the  integrand  for  each  t. 
QED. 

Even  though  this  value  of  RE(t) 

minimizes  the  integral,  the  size  of  the 
expression 

to 

(l-F(t))  E((S-t)2  I  S>t]  dt 

sets  a  lower  bound  for  E(error) .  Since 
this  term  is  proportional  to  the  variance 
of  S,  there  may  be  considerable  error  in 
the  estimate  even  in  this  best  case. 


memoryless:  RE(t)  =  EtS} 


(independent  of  t) 
pastrepeats:  RE(t)  “ 


distribution:  RE(t)  *  E(S-t  I  S  >  t) 


optimal : 


Rgtj)  -  R(t) 


As  a  special  case,  the  memoryless  method 
is  the  distribution  method  for  the  ex¬ 
ponential  distribution.  The  first  step  in 
comparing  these  methods  is  the  following 
theorem: 


Theorem.  If  the  service  distribution 
is  known,  then  the  distribution  method  has 
minimal  RMS  error  when  averaged  over  the 
lifetimes  of  all  jobs. 


Proof .  Given  that  S=s,  the  mean 
square  error  of  an  estimation  method  over 
the  lifetime  of  a  particular  job  is 

Elerror  I  S*=s]  »  s  (RE(t)-(s-t))2  dt 


'•  '  E 

Thus  the  overall  mean  square  error  is 

Elerror]  =  Elerror  I  S=s]  f(s)  ds. 


Substituting  the  expression  for 
El  error  I  S**s]  into  the  second  equation 
above,  reversing  the  order  of  integration, 
and  expanding  the  squared  term  yields: 


E|  error  }  =  ,/  (l-F(t))  C R_ ( t ) 2 


-  2  RE(t)  E{  S-t  I  S^t] 


+  E  ((S-t)2  I  S>t}]  dt 


If  for  every  t  we  select  RE(t)  so  as  to 
minimize  the  expression  in  square  brack¬ 


Furthermore,  the  distribution  method 
is  difficult  to  apply  in  practice.  In 
most  situations,  the  service-time  distri¬ 
bution  is  not  known  and  must  itself  be  es¬ 
timated.  Without  an  analytic  formula  for 
f(s),  Rjj(t)  must  be  evaluated  numerically 

from  the  empirical  service-time  density, 
at  considerable  expense  in  both  time  and 
space.  We  will  return  to  the  problem  of 
selecting  a  good  estimation  method  far 
remaining  service  time  after  discussing 
how  such  a  method  can  lead  to  estimates  of 
response  time. 


2.2  Estimating  Response  Time 


The  primary  performance  measure  for 
any  interactive  computer  system  is 
response  time.  It  is  therefore  natural 
for  our  scheduling  policy  to  attempt  to 
minimize  overall  response  time.  We  have 
assumed  that  jobs  on  each  processor  are 
scheduled  according  to  a  round-robin  dis¬ 
cipline;  for  computational  simplicity  we 
now  assume  that  the  quantum  size  is  small 
enough  that  the  behavior  of  the  round- 
robin  scheduler  can  be  modeled  by  proces¬ 
sor  sharing12. 


Let  J(P)  be  the  set  of  jobs  present 
at  processor  P,  and  let  k  (not  in  J(P))  be 
a  potential  migrant  to  processor  P  whose 
response  time  we  wish  to  estimate  should 
it  move  to  P.  We  always  know  the  used 
processor  time  ^  for  each  job  i;  this 

figure  could  -ome  from  updating  each  job's 
used  processor  time  at  the  end  of  each 
quantum  of  service  devoted  to  it.  Then 
RSP(k,J(P)),  the  estimated  response  time 
of  job  k  at  processor  P,  can  be  calculated 
according  to  the  following  algorithm: 


Algorithm  ESTRESPONSE 
{  calculate  an  estimate  RSP(k,S)  of 
the  response  time  for  job  k  if  it 
were  located  at  a  processor  with  the 
set  of  jobs  S  } 

R  2  *  Rg  ( t  ) ; 

for  all  j  in  S  do 
begin 

if  Rg(tj)  <  Rg(t^) 

then  R  :*=  R  +  RE(tj) 

else  R  :=  R  +  R£( t^) ; 

end; 

RSP(k,S)  :«  R; 

To  understand  why  ESTRESPONSE  provides  a 
good  estimate  of  the  job's  response  time, 
let  us  suppose  that  the  estimation  method 
is  optimal .  Then  we  claim  that 
RSP(k, J(P) )  Is  the  true  response  time  of 
job  k  provided  that  no  new  jobs  arrive  at 
processor  P  during  the  execution  of  job  k. 
All  jobs  j  in  J(P)  with  Rg(t^)  _>  Rgtt^) 

will  terminate  after  job  k,  so  they  will 
be  present  in  the  system  throughout  job 
k's  execution.  Therefore  if  we  reduce  the 
remaining  execution  times  of  all  such  jobs 
to  that  of  job  k  we  will  not  modify  job 
k’s  response  time  j.n  any  way.  The  total 
length  of  the  modified  schedule  is  the 
same  as  the  response  time  of  job  k.  Since 
the  scheduling  policy  is  work-conserving, 
the  length  of  the  schedule  is  the  same 
under  FCFS  as  it  is  under  PS.  The  algo¬ 
rithm  correctly  calculates  the  length  of 
this  schedule. 

For  the  purposes  of  job  migration  we 
will  want  to  extend  the  algorithm  ES¬ 
TRESPONSE  to  handle  the  case  where  k  is  in 
S.  The  extension  skips  the  iteration  of 
the  for  loop  in  the  case  j»k. 

RSP(k,J(P))  estimates  the  response 
time  of  job  k  at  processor  P  given  that 
job  k  has  arrived  at  processor  P.  The  to¬ 
tal  response  time  of  job  k  at  processor  P 
is  RSP(k,J(P))  plus  the  time  required  to 
transfer  job  k  to  processor  P.  We  define 
TOTRSP(k, P, Q)  as  the  total  response  time 
of  job  k  at  processor  Q  given  that  job  k 
is  presently  at  processor  P.  Formally, 


job  k  and  all  of  the  other  migrants  in  the 
communications  queue  from  processor  P  to 
Q.  Since  this  queue  presumably  resides  in 
P's  main  memory,  calculation  of 
TRANSFER(k, P,Q)  is  straightforward. 

2*2  h.  Practical  Estimat ion  Method 

As  mentioned  above,  while  the  distri¬ 
bution  method  of  calculating  RE(t)  is  op¬ 
timal  in  a  certain  sense,  it  is  difficult 
to  implement  in  the  absence  of  an  analyti¬ 
cal  formula  for  f(s).  We  therefore  con¬ 
ducted  some  preliminary  simulations  to  es¬ 
timate  the  accuracy  of  RSP(k,J(P))  when 
the  estimation  methods  memoryless,  distri¬ 
bution,  and  pastrepeats  were  used.  We 
used  an  M/G/l-PS  queueing  system  as  the 
test  case.  Our  accuracy  measure  was  the 
time-averaged  root-mean-square  error 
between  RSP(k,J(P))  under  the  test  estima¬ 
tion  methods  and  RSP(k,J(P))  based  on  the 
optimal  method.  Observations  of  the  error 
were  taken  for  each  job  in  system  at  each 
job  departure  instant.  We  summarize  these 
statistics  in  Table  I. 


Estimation 

Method 

1  Service  Time 
!  uniform 

1  low  high 

1 

Distribution 

hyper 

low  high 

memoryless 

1  18.4 

41.8 

33.9 

47.1 

distribution 

1  6.8 

14.4 

33.8 

55.7 

pastrepeats 

1  12.5 

23.6 

36.7 

50.2 

Table  I 

RMS  Error  in  RSP(k,J(P)) 


The  service-time  distributions  we  used 
were  uniform(0.5,6.5)  and  a  two-stage  hy- 

.  .  2 

perexponential  with  mean  3.5  and  CV 

(squared  coefficient  of  variation  = 

VAR(S)/E{S]2)  of  3.0.  Two  values  for  the 
arrival  rate  were  used  to  provide  statis¬ 
tics  at  low  and  high  loadings  of  the  sys¬ 
tem  (utilization  ■  0.70  and  0.91  respec¬ 
tively)  . 


TOTRSP(k,P,Q)  -  TRANSFERS,  P,Q) 

+  RS  P  ( k ,  J  ( Q ) ) 

where  TRANSFER ( k, P,Q)  is  the  time  required 
to  transfer  job  k  from  P  to  Q. 

We  have  assumed  that  migrations  are 
"local"  in  the  sense  that  a  job  will  only 
be  migrated  to  a  neighboring  processor,  so 
P  and  Q  must  be  immediate  neighbors  (i.  e. 
there  is  a  direct  communications  link  from 
P  to  Q) .  TRANSFER ( k, P, Q)  can  therefore  be 
calculated  from  the  speed  of  the  communi¬ 
cations  device  and  the  sum  of  the  sizes  of 


We  see  from  the  table  that  in  com¬ 
parison  to  the  other  methods,  the  memory¬ 
less  method  performs  well  for  the  hyperex¬ 
ponential  distribution,  but  poorly  for  the 
uniform  distribution.  This  result  can  be 
attributed  to  the  approximate  exponential- 
ity  of  the  hyperexponential  distribution. 
Because  the  memoryless  method  is  dependent 
on  the  form  of  the  service-time  distribu¬ 
tion,  it  seems  unsuitable  for  our  use. 
The  distribution  method  has  the  best 
overall  accuracy,  but  the  pastrepeats 
method  is  not  much  worse.  Because  of  the 


ease  of  implementation  of  pastrepeats 


pared  to  the  distribution  method,  we  use 
pastrepeats  in  our  distributed  scheduling 
policy. 

2*4  A  Total  Load  Estimate 

For  the  purposes  of  the  cooperation 
policy  as  described  below,  it  will  be  con¬ 
venient  to  have  a  single  number  that 
represents  the  total  load  at  a  particular 
processor.  The  total  number  of  jobs 
present  at  a  processor  is  unsuitable  for 
such  an  estimate  since  the  true  load  could 
vary  widely  depending  on  the  remaining 
service  times  for  those  jobs.  On  the  oth¬ 
er  hand,  the  sum  of  all  remaining  service 
times  does  not  provide  an  accurate  picture 
without  including  the  number  of  jobs. 

To  illustrate  these  problems,  suppose 
processor  A  has  one  job  with  a  true 
remaining  service  time  of  200  seconds, 
while  processor  B  has  100  jobs  each  with  a 
true  remaining  service  time  of  1  second. 
Then  the  response  time  of  a  new  one-second 
job  sent  to  processor  A  would  be  two 
seconds,  while  if  the  job  were  sent  to 
processor  B  its  response  time  would  be  101 
seconds  (assuming  no  other  new  arrivals). 
On  the  other  hand,  the  response  time  of  a 
new  200-second  job  at  processor  A  would  be 
400  seconds,  while  at  processor  B  it  would 
be  300  seconds. 

As  a  composite  measure  of  these  ef¬ 
fects  we  define  the  load  at  processor  P  as 
RSP(k, J(P ) )  where  k  is  a  job  whose  remain¬ 
ing  service  time  is  equal  to  the  average 
overall  service  time.  All  processors  in 
the  network  use  the  same  value  for  this 
average.  For  example,  if  the  average 
service-time  requirement  is  20  seconds, 
then  the  load  on  the  one- job  processor 
mentioned  above  is  40  seconds;  the  load  on 
the  100- job  processor  is  120  seconds. 
Since  the  expected  response  time  is  always 
equal  to  or  greater  than  the  average  ser¬ 
vice  time,  we  subtract  the  average  service 
time  to  yield  the  expected  delay  measure. 

Tiie  expected  delay  may  be  used  to  de¬ 
cide  if  any  job  at  all  should  be  migrated 
between  two  adjacent  processors.  If  they 
have  very  different  expected  delays,  some 
job  on  the  more  heavily  loaded  processor 
should  be  chosen  for  migration. 

3.*  Cooperation  Policy 

Processors  must  communicate  in  order 
to  reach  decisions  on  migration.  This 
communication  may  be  structured  in  various 
ways.  At  one  extreme,  each  machine  can 
periodically  broadcast  its  load  estimate. 
This  method  leads  to  very  large  overhead, 
since  each  machine  periodically  hears  mes¬ 
sages  from  every  other  machine.  It  is 
also  difficult  to  match  processors  togeth¬ 
er,  but  if  processors  are  not  matched, 


then  or.e  very  underloaded  machine  can  sud¬ 
denly  receive  migrant  jobs  from  many 
sources,  turning  it  into  a  very  overloaded 
••'•■jchine.  Soon  this  processor  will  try  to 
t-.i.d  some  of  those  jobs  elsewhere,  leading 
to  processor  thrashing. 

Since  we  restrict  migration  to  follow 
the  point-to-point  communications  lines  of 
the  computer  network,  unlimited  broadcast 
may  be  restricted  to  direct-neighbor 
broadcast.  The  problem  of  processor 
thrashing  is  still  present,  as  is  the 
large  number  of  messages  needed. 

An  elegant  means  of  structuring  com¬ 
munication  is  to  build  temporary  pairings 
between  processors.  One  algorithm  has 

been  given  by  Finkel  for  constructing 
permanent  pairs.  A  modified  version  of 
the  algorithm  has  this  outline:  Each  pro¬ 
cessor  asks  some  randomly  chosen  direct 
neighbor  if  it  will  pair.  While  awaiting 
an  answer,  the  querier  rejects  any  queries 
from  other  neighbors.  If  it  receives  a 
rejection,  it  again  picks  a  random  neigh¬ 
bor  and  tries  again.  If  it  receives  a 
query  from  its  own  intended  mate,  then  a 
pair  is  formed.  The  pair  is  broken  at  the 
mutual  agreement  of  the  two  mates,  for  ex¬ 
ample,  after  a  job  has  been  migrated.  It 
is  not  necessary  that  both  break  from  the 
pair  simultaneously.  During  the  time  that 
a  pair  is  in  force,  both  mates  reject  oth¬ 
er  queries. 

Bernstein  has  proposed  a  related  al¬ 
gorithm  for  pairing15.  He  assigns  iden¬ 
tification  numbers  to  each  processor  and 
allows  processors  to  remember  who  has 
queried  them.  In  one  simple  variant,  a 
processor  that  would  otherwise  reject  a 
query  will  instead  postpone  a  decision  if 
the  querier  has  a  larger  identification 
number  than  his  own.  In  addition,  the 
identification  number  of  the  querier  must 
be  higher  than  the  identification  number 
of  any  other  querier  that  has  been  post¬ 
poned;  this  rule  prevents  cycles  of  post¬ 
ponement.  If  there  is  a  current  postponed 
query  and  a  new  query  is  postponed  in¬ 
stead,  the  original  querier  gets  a  rejec¬ 
tion.  At  the  time  the  postponer  either 
gets  a  rejection  itself  from  its  intended 
mate  or  breaks  from  a  pair,  it  may  immedi¬ 
ately  pair  with  the  postponed  querier  (by 
sending  it  a  query) . 

Simulations  were  run  to  compare 
Bernstein's  identification  number  method 
with  the  other  method,  which  we  will  call 
the  simple  method.  All  tests  are  per¬ 
formed  on  a  square  array  (5x5  processors) 
connected  in  a  mesh  (each  internal  proces¬ 
sor  has  four  direct  neighbors) .  The  pro¬ 
cessors  were  given  identification  numbers 
in  row-major  order.  The  results  may  be 
summarized  as  follows: 


1.  Both  methods  are  fairly  insensitive 
to  the  message  delay  distribution, 
be  it  exponential,  Erlangian,  or  un¬ 
iform  (with  a  small  window)  about 
the  mean.  We  therefore  chose  a  uni¬ 
form  message  delay  distribution  for 
computational  simplicity. 

2.  If  pairs  last  time  p,  average  mes¬ 
sage  passing  time  is  m,  and  average 
waiting  time  for  achieving  pair  is 
w,  then  when  p  >>  m,  the 
identification-number  method  yields 
w  ■  p/2,  The  simple  method  takes  up 
to  80%  longer  (and  gets  up  to  80% 
fewer  pairs  per  second),  with  the 
difference  greatest  when  p  is  small 
with  respect  to  m. 

The  purpose  of  migration  is  to  disperse 
load  evenly  across  the  network.  We  tested 
the  ability  of  the  simple  and  identifica¬ 
tion  number  methods  to  disperse  load  by 
placing  100  tokens  on  some  processor. 
When  processors  pair,  tokens  are  moved 
between  them  to  equalize  the  count  on  the 
two  processors.  Such  a  pairing  lasts  a 
time  equal  to  the  number  of  tokens 
transferred.  The  tokens  therefore  model 
migrating  jobs.  A  measure  of  the  current 
dispersion  is  the  variance  of  the  number 
of  tokens  across  all  processors;  this  fig¬ 
ure  reaches  0  after  enough  time  has 
elapsed. 

Depending  on  which  processor  is  given 
the  100  tokens,  the  identification  number 
method  either  performed  significantly 
better  or  significantly  worse  than  the 
simple  method.  The  best  performance  was 
achieved  when  processor  1  gets  the  initial 
tokens;  worst  performance  was  elicited 
when  processor  25  got  them.  This  asym¬ 
metry  is  due  to  the  fact  that  a  low- 
numbered  processor  usually  has  a  postponed 
querier  at  the  time  it  breaks  a  pair  or 
gets  a  rejection;  a  high-numbered  proces¬ 
sor  usually  does  not  have  such  a  postpone¬ 
ment.  Therefore,  lower-numbered  proces¬ 
sors  form  pairs  more  quickly  and  are  able 
to  disperse  their  tokens  more  efficiently. 

In  order  to  remove  the  asymmetry, 
postponements  were  decided  on  the  basis 
not  of  identification  number,  but  rather 
preferring  to  postpone  the  querier  whose 
number  of  tokens  is  most  different  from 
the  postponer's.  This  method,  which  we 
call  the  absolute  difference  method, 
yielded  superior  dispersion.  However,  it 
is  possible  for  this  method  to  enter  a 
deadlock,  although  this  situation  was  nev¬ 
er  observed.  A  deadlock  may  be  formed  of 
a  cycle  of  four  processors,  two  of  which 
have  0  tokens,  and  the  other  two  have  one 
token.  If  each  has  queried  its  clockwise 
neighbor,  every  query  will  be  postponed. 


To  insure  absence  of  deadlock,  it  is 
sufficient  to  include  a  timeout  on  post¬ 
ponement,  so  that  any  processor  that  is 
postponing  a  query  for  more  than  a  given 
amount  of  time  sends  a  rejection  instead. 
An  alternate  method  is  to  used  a  signed 
difference  method,  preferring  to  postpone 
the  querier  whose  number  of  tokens  is 
least  (and  less  than  the  postponer's  token 
count) .  This  method  yields  very  good 
dispersion  if  the  number  of  tokens  placed 
on  the  first  processor  is  positive,  and 
adequate  (but  inferior)  dispersion  if  the 
initial  number  is  negative.  Intuitively, 
this  method  will  quickly  remove  jobs  from 
overloaded  processors,  but  is  not  so  quick 
to  add  them  to  underloaded  ones. 

Figure  1  (placed  at  the  end  of  the 
paper)  presents  a  graph  of  dispersion 
verses  simulated  time  for  the  simple, 
identification  number,  absolute,  and 
signed  methods.  On  the  basis  of  these 
data  we  chose  the  signed  method  for  our 
distributed  load-balancing  algorithm. 

4.  The  load-balancing  algorithm 

The  cooperation  policy  may  be  com¬ 
bined  with  the  load  estimation  policy  to 
create  an  overall  load-balancing  algo¬ 
rithm.  Current  load  estimates  are  used  in 
the  signed  pairing  algorithm  to  form  pairs 
that  differ  greatly  in  load.  Once  a  pair 
is  formed,  the  more  heavily  loaded  proces¬ 
sor  decides  which  jobs,  if  any,  to  send  to 
the  other  processor,  based  on  greatest  ex¬ 
pected  improvement  in  the  response  ratio 
of  the  migrant  jobs.  A  processor  only 
tries  to  find  a  mate  if  it  has  at  least 
two  jobs;  otherwise,  migration  from  this 
processor  is  never  reasonable.  However, 
every  processor  is  willing  to  respond 
favorably  to  a  query. 

Several  variants  of  this  algorithm 
have  been  tested  by  simulation.  One 
result  we  quickly  found  was  that  fruitless 
pairings  and  attempts  were  very  frequent. 
A  pairing  is  fruitless  if  no  job  can  be 
migrated.  An  attempt  is  fruitless  if  a 
query  is  rejected.  These  fruitless  ac¬ 
tions  would  also  cause  overhead  in  an  ac¬ 
tual  implementation.  To  reduce  this  over¬ 
head,  we  introduced  a  relaxation  period 
after  a  processor  has  queried  all  of  its 
neighbors.  During  this  period,  no  new  at¬ 
tempts  at  pairing  are  made.  Arrival  of  a 
new  job  or  a  query  terminates  relaxation 
early.  We  also  changed  the  pairing  algo¬ 
rithm  so  that  only  the  first  neighbor 
queried  is  chosen  at  random;  when  a  reply 
is  received,  the  next  neighbor  in  turn  is 
asked  until  all  neighbors  have  been 
queried.  This  variant  of  the  pairing  al¬ 
gorithm  is  a  distributed  analog  of  polling 
in  a  centralized  system. 
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We  also  found  that  in  order  to  prox 
duce  a  reasonable  number  of  migrations, 
the  time  a  processor  remains  paired  must 
be  as  short  as  possible  and  multiple  mi¬ 
grants  per  pairing  must  be  allowed.  Pro¬ 
cessors  cannot  remain  paired  during  the 
entire  period  they  are  exchanging  jobs, 
and  the  actual  migration  of  jobs  must  be 
as  rapid  as  possible.  In  order  to  accom¬ 
plish  these  goals,  we  assume  that  each 
processor  maintains  two  job  lists.  One 
list  contains  the  jobs  presently  residing 
at  the  processor;  the  other  contains  jobs 
being  sent  to  the  processor  that  have  not 
yet  been  received.  As  jobs  are  received, 
they  are  removed  from  the  second  list  and 
placed  on  the  first. 

Every  query  sent  from  the  processor 
includes  both  job  lists;  the  time  consumed 
by  each  job  is  also  included.  For  the 
purpose  of  calculating  the  load  or 
response  time  at  a  processor,  the  set  J(P) 
is  taken  as  the  union  of  the  two  lists. 

The  list  of  migrating  jobs  is  updated 
as  follows:  The  sending  processor  always 
decides  which  jobs  to  send  based  on  its 
current  load  and  its  mate's  load  (as  of 
the  last  query  from  its  mate) .  The  sender 
sends  a  list  of  migrant  jobs  before  the 
first  migrant  itself  is  sent.  The  receiv¬ 
ing  processor  remains  paired  just  long 
enough  to  receive  this  list.  A  processor 
can  therefore  never  become  overloaded  by 
receiving  migrants  from  more  than  one 
source,  nor  can  a  group  of  jobs  in  transit 
cause  a  particular  processor  to  become 
overloaded;  the  destination  processor  will 
always  know  about  them  before  they  are 
sent.  In  this  way  the  pairing  algorithm 
allows  stable  distributed  scheduling  deci¬ 
sions  to  be  made. 

To  allow  more  than  one  migration  per 
pairing  is  straightforward.  Let  us  sup¬ 
pose  that  processors  A  and  B  are  paired, 
and  let  us  assume  that  A  has  a  larger 
load.  Then  A  has  two  sets  of  jobs  to  con¬ 
sider:  J(A)  and  C,  where  C  is  the  copy  of 
J(B)  that  processor  B  sent  to  A.  Now  for 
each  job  j  in  J(A),  processor  A  calculates 
the  ratio 

Rr.(  j.J(A)) 

- - - 

Rg( j.C)+TRANSFER( j,A,B) 

This  ratio  represents  the  improvement  in 
service  that  job  j  can  receive  at 
processor  B.  If  Ig(j)  <  1  for  all  j  in 

J(P)  then  no  migrant  is  sent.  A  sends  B  a 
message  indicating  this  fact  and  the  pair 
is  broken.  On  the  other  hand,  if 
1sO)  *  1  for  one  or  more  jobs,  the  job 

with  the  largest  service  improvement  is 
selected  as  the  first  migrant.  To  allow 


for  the  ev.entual  presence  of  this  job  at 
B,  j  is  removed  from  J(A)  and  a  copy  is 
inserted  into  C.  The  actual  job  j  is  re¬ 
moved  from  the  ready  set  and  placed  in  the 
communications  queue  for  processor  B.  The 
entire  procedure  is  repeated  to  determine 
if  there  are  other  jobs  to  migrate.  When 
this  algorithm  terminates  with  Ig(j)  *  1 

for  all  j,  processor  A  formulates  a  mes¬ 
sage  listing  the  migrants  to  be  sent  to  B, 
puts  the  message  at  the  head  of  the  com¬ 
munications  queue  and  starts  the  transfer. 
Processor  A  immediately  breaks  its  pairing 
with  processor  B,  and  if  it  has  not  yet 
polled  all  of  its  neighbors,  begins  send¬ 
ing  a  query  to  its  next  neighbor  in  turn. 
Otherwise  processor  A  relaxes. 

5.  The  Distributed  Scheduling  Algorithm 

Before  describing  the  results  of  our 
simulations,  we  present  a  concise  descrip¬ 
tion  of  the  algorithm.  Each  processor  can 
be  in  one  of  three  states:  idle,  pairing 
and  migration .  A  processor  is  idle  if  it 
has  less  than  two  jobs  or  is  relaxing. 
Processors  cooperate  by  sending  one  of 
three  types  of  messages:  queries,  rejec¬ 
tions  ,  and  migrations. 

A  processor  enters  the  pairing  state 
when  a  second  job  arrives,  when  it  re¬ 
ceives  a  query  from  a  neighbor,  when  its 
relaxation  period  expires,  or  when  it 
leaves  the  migration  state  and  has  not  yet 
queried  all  of  its  neighbors.  In  the  - 
pairing  state,  processors  try  to  form 
pairs  by  cyclically  querying  each  neigh¬ 
bor;  at  the  end  of  each  cycle,  the  proces¬ 
sor  relaxes.  The  first  neighbor  in  each 
cycle  is  chosen  at  random.  Each  query 
message  includes  a  list  of  local  jobs, 
both  those  present  and  those  in  migration. 
This  information  is  used  to  compute  the  - 
querier's  load  for  purposes  of  pairing. 

A  processor  in  the  idle  state 
responds  to  a  query  with  a  query,  forming 
a  pair.  Processors  in  the  pairing  state 
respond  with  a  query  only  if  queried  by 
their  intended  mate;  otherwise,  they 
respond  with  a  rejection  message.  Proces-  . 
sors  in  the  migration  state  respond  with 
rejection.  Rejections  may  be  postponed  if 
the  querier  has  a  lower  load  than  the  re¬ 
jector  and  that  load  is  lower  than  the 
load  on  a  postponed  querier.  Only  one 
querier  may  be  postponed  at  a  time;  any 
other  querier  gets  an  immediate  rejection. 

Once  a  pair  is  formed,  the  mates 
enter  the  migration  state.  The  member 
with  the  larger  load  selects  jobs  to  mi¬ 
grate  to  the  other  and  sends  a  list  of 
those  jobs,  followed  by  the  jobs  them¬ 
selves,  if  any.  As  soon  as  this  decision 
has  been  made,  the  sender  breaks  the  pair 
and  returns  to  the  pairing  state  (or  the 


simulated  time. 


idle  state,  if  it  is  time  to  relax).  When 
the  recipient  receives  the  list,  it  also 
breaks  the  pair  and  returns  to  the  pairing 
state  (or  the  idle  state)  . 

The  jobs  to  be  migrated  are  selected 
by  comparing  their  expected  time  to  com¬ 
plete  on  their  current  host  with  the  ex¬ 
pected  time  to  complete  on  its  mate.  Mi¬ 
gration  delay  is  included  in  this  esti¬ 
mate.  The  job  with  the  best  ratio  of  ser¬ 
vice  time  on  the  mate  to  service  time  on 
the  current  host  is  selected  to  be  sent 
first.  Decisions  for  other  jobs  are  based 
on  the  assumption  that  the  first  job  has 
been  received  by  the  mate. 


relax 

time 

pair¬ 

ings 

termi-  migra- 
nations  tions 

resp. 

ratio 

msgs 

sent 

0.01 

4961 

I960 

819 

4.09 

16760 

0.10 

4760 

1953 

836 

4.14 

15317 

0.25 

4507- 

1949 

797 

4.09 

13972 

0.50 

4138 

1948 

820 

4.33 

12928 

0.75 

3905 

1941 

828 

4.46 

12115 

1.00 

3696 

1949 

797 

4.40 

11595 

2.00 

3307 

1928 

780 

4.70 

10889 

Table  II 

Scheduling  Algorithm  Performance 
For  Different  Relaxation  Intervals 
Odd  Numbered  Processors  Loaded 


6.  Simulation  Results 

These  experiments  were  also  performed 
on  a  square  mesh  of  25  processors.  The 
system  arrival  process  was  assumed  to  be 
Poisson  with  intensity  equal  to  80%  of  25 
times  the  per-processor  service  rate.  The 
total  system  utilization  was  therefore 
0.8.  Various  arrival  patterns  were  tested 
to  see  how  well  the  migration  algorithm 
adapted  to  local  imbalance.  For  example, 
in  one  test  case  all  jobs  were  initially 
assigned  to  an  arbitrary  processor  chosen 
at  random,  while  in  another  test  case  all 
arriving  jobs  were  assigned  to  odd- 
numbered  processors. 


These  results  show  that  the  algorithm  is 
relatively  insensitive  to  the  relaxation 
interval  until  the  relaxation  interval  be¬ 
comes  large  in  relation  to  the  average  job 
inter-arrival  time.  Since  changing  the 
relaxation  interval  from  0.01  to  0.5 
reduces  the  total  number  of  messages  sent 
by  more  than  20%,  we  have  elected  to  use  a 
relaxation  interval  of  0.5  seconds. 
Although  the  number  of  messages  given  in 
the  table  may  seem  excessive,  the  number 
given  represents  only  about  5  messages  per 
processor  per  second. 

6.2  Effectiveness  of  Load  Balancing 


The  message  delay  time  was  set  to 
0.10  seconds  (a  reasonable  figure  on 
Arachne),  and  the  service  time  distribu¬ 
tion  was  a  two-stage  hyperexponential  dis¬ 
tribution  with  CV^»3.0  and  a  mean  of  1.0. 

(It  is  well  known  that  CPU  service  time 
distributions  are  more  skewed  than  ex¬ 
ponential;  a  hyperexponential  distribution 
is  a  convenient  representation  of  this 

fact16)  .  Rather  than  simulate  a  round- 
robin  schedule,  the  simulation  actually 
implements  a  processor-sharing  scheduler. 
The  simulations  were  written  in  SIM- 
17  18 

PAS  '  ,  a  simulation  extension  of  the 

language  PASCAL. 

6.1^  A  Relaxation  Time  Experiment 

In  our  first  experiment,  we  tested 
various  settings  of  the  relaxation  time  to 
see  how  it  affected  the  quality  of  the 
load  balancing.  We  measure  this  quality 
by  the  average  response  ratio  perceived  by 
all  jobs  that  terminate.  In  this  test, 
all  new  jobs  were  assigned  to  odd-numbered 
processors  chosen  at  random  according  to  a 
uniform  distribution.  Table  II  shows  the 
statistics  for  the  first  100  seconds  of 


To  demonstrate  the  effectiveness  of 
load  balancing  in  our  algorithm,  we  dis¬ 
cuss  two  additional  simulation  runs. 

In  the  first  run  each  arriving  job  is 
assigned  at  random  according  to  a  uniform 
distribution  to  one  of  the  25  processors. 
Since  the  simulation  implements  a 
processor-sharing  scheduler  and  the  ar¬ 
rival  process  is  Poisson,  it  follows  that 
without  migration  the  mean  number  of  jobs 
at  each  processor  would  be  the  same  as  for 

an  M/M/1  queueing  system1^  with  utiliza¬ 
tion^.  8.  Thus 

mean  number  of  jobs  per  processor  = 
rho  /  (1-rho) 
mean  time  in  system  • 

1  /(lambda  *  (1-rho)) 

If  there  is  no  migration,  the  mean  number 
of  jobs  per  processor  should  be  4;  the 
mean  time  in  system  should  be  6.25 
seconds.  The  response  ratio  should  there¬ 
fore  be  about  6.25  (since  the  average  ser¬ 
vice  time  is  1  second).  Table  III  gives 
simulation  results  for  our  algorithm  dur¬ 
ing  each  25  seconds  of  a  100  second  run. 
(The  statistics  are  reset  at  the  end  of 
each  25  second  interval . ) 


1 


time 

( seconds) 

response 

ratio 

average  queue 
size 

25 

2.29 

1.33 

50 

3.29 

2.23 

75 

3.33 

2.50 

100 

3.29 

2.47 

Table  III 
Migration  Results 
All  Processors  Loaded 

These  data  show  the  clear  superiority  of 
migration  even  when  the  load  is  very  uni¬ 
form. 

For  our  final  example  we  consider  a 
case  in  which  all  arrivals  to  the  system 
are  initially  assigned  to  one  of  four  pro¬ 
cessors.  Figure  2  shows  the  location  of 
the  loaded  processors  in  the  network,  with 
the  id  numbers  for  the  loaded  processors 
enclosed  in  square  brackets. 
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Figure  2 

Loaded  Processors  for  4-Processor  Example 

Since  the  processor  interconnections  are 
along  horizontal  and  vertical  grid  lines 
only,  processor  13  is  surrounded  by  heavi¬ 
ly  loaded  processors.  We  conjecture  that 
a  less  sophisticated  scheduling  algorithm 
would  overload  processor  13.  In  fact,  our 
simulations  show  that  while  the  system  as 
a  whole  is  overloaded  ( the  loaded  proces¬ 
sors  simply  cannot  offload  jobs  fast 
enough),  168  jobs  are  sent  to  processor  13 
during  100  seconds  while  only  8  jobs  are 
sent  back  from  processor  13  to  one  of  the 
loaded  processors . 

2-  Concluding  Remarks 

We  have  described  a  new  distributed 
load  balancing  algorithm  in  which  proces¬ 
sors  cooperate  in  making  load  balancing 
decisions  but  do  not  depend  on  a  central¬ 
ized  controller.  Although  the  algorithm 
is  complex,  we  believe  that  the  stability 
of  this  algorithm  is  worth  the  complexity. 
The  experience  we  have  gained  in  studying 
this  algorithm  through  simulation  will 
help  us  when  we  install  it  in  the  Arachne 
multicomputer  system. 


We  are  currently  investigating 
simpler  load  migration  schemes  in  order  to 
evaluate  the  importance  of  stability  in 
load  balancing  algorithms  of  this  kind. 
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ABSTRACT 

We  present  a  distributed  algorithm  for  implementing  a-/3  search  on  a  tree 
of  processors.  Each  processor  is  an  independent  computer  with  its  own 
memory  and  is  connected  by  communication  lines  to  each  of  its  nearest  neigh¬ 
bors.  Measurements  of  the  algorithm’s  performance  on  the  Arachne  distri¬ 
buted  operating  system  are  presented.  A  theoretical  model  is  developed  that 
predicts  at  least  order  of  k'h  speedup  with  k  processors. 
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1.  INTRODUCTION 

The  a-/3  search  algorithm  is  central  to  most  programs  that  play  games  like  chess.  It  is 
now  well-known1  that  an  important  component  of  the  playing  skill  of  such  programs  is  the 
speed  at  which  the  search  is  conducted.  For  a  given  amount  of  computing  time,  a  faster  search 
allows  the  program  to  "see”  farther  into  the  future.  In  this  paper  we  present  and  analyze  a 
parallel  adaptation  of  the  a-/3  algorithm.  This  adaptation,  which  we  call  the  tree-splitting  algo¬ 
rithm,  speeds  up  the  search  of  a  large  tree  of  potential  continuations  by  dynamically  assigning 
subtree  searches  for  parallel  execution. 

In  Section  2,  we  review  the  a-l 3  algorithm.  Section  3  discusses  parallel  implementations 
of  the  «-0  algorithm  suggested  by  other  workers.  Section  4  formally  describes  the  tree-splitting 
algorithm.  Section  S  discusses  some  possible  optimizations  and  variations  of  the  algorithm. 
Section  6  presents  two  sets  of  performance  measurements  for  this  algorithm.  One  set  was 
taken  through  simulation,  the  other  on  a  network  of  microprocessors.  Section  7  develops  a 
theoretical  model  that  predicts  speedup  for  an  arbitrary  number  of  processors,  and  Section  8 
compares  these  predictions  with  the  measurements  of  Section  6. 

2.  THE  ALPHA-BETA  ALGORITHM 

We  assume  that  the  reader  is  familiar  with  the  negamax  and  alphabeta  algorithms  as 
described  by  Knuth  and  Moore:2 

function  negamax  (p:position) :  integer ; 

var  m:  integer ; 

i,d  :  I..MAXCHILD; 

succ  :  array  [1..MAXCHILD]  of  position; 

begin 

determine  the  successor  positions  succ[l],...,succ[d]; 
if  d  “  0  then  (  terminal  position  ) 
return  (staticvalue(p)) ; 

{ find  maximum  of  child  values  ) 
m 

for  i 1  to  d  do 

m  max(m,-  negamax(succ[i]»; 
return  (m)\ 

end ; 


function  alphabeta(p  :  position;  a,/3  :  integer)  :  integer ; 
var  i,d  :  1..MAXCHILD; 

succ  :  array  (1..MAXCHILD1  of  position; 

begin 

determine  the  successor  positions  succ[l],...,succ[d]; 
if  d  ™  0  then  ( terminal  position  | 
return  (staticvalue(p) ) ; 
for  i  1  to  d  do 
begin 

a  :»  max(a,  -  alphabeta(succ[i],  -/3,-a)); 
if  a  ^  fi  then  return(a)  (  cutoff  ) 

end ; 

return  (a) 

end ; 

The  function  alphabets  obeys  the  following  important  property:  For  a  given  position  p, 
and  for  values  of  a  and  /?  such  that  a  <  /3, 

if  negamax(p)  ^  a,  then  alphabeta(p,a,/3)  <  a 
if  negamax(p)  >  /3,  then  alphabeta(p,a,0)  >  /3 
if  a  <  negamax(p)  <  /3,  then  alphabeta(p,a,/3)  —  negamax(p). 

The  first  and  second  cases  above  are  called  failing  low  and  failing  high  respectively.  In  the 
third  case,  success,  alphabets  accurately  reports  the  negamax  value  of  the  tree.  Success  is 
assured  if  a  «*  -  °o  and  /3  “  <*.  The  pair  (a, (3)  is  called  the  window  for  the  search. 

The  alpha-beta  algorithm  is  strongly  serial:  It  uses  information  from  one  part  of  the  loo¬ 
kahead  tree  to  avoid  work  in  another  part.  If  the  lookahead  tree  is  decomposed  into  several 
pieces  and  these  pieces  searched  simultaneously,  work  that  the  serial  algorithm  avoids  may  be 
performed.  Nevertheless,  we  will  see  that  such  a  decomposition  can  achieve  speedup. 

Since  we  will  later  be  referring  to  a  tree  of  processors,  we  reserve  the  following  notation 
for  nodes  of  lookahead  trees:  A  node  is  often  called  a  position.  A  node's  child  is  its  successor, 
and  its  parent  is  its  predecessor.  If  each  interior  node  has  n  successors,  we  say  that  the  tree  has 
degree  ».  The  level  of  a  node  or  subtree  is  its  distance  from  the  root. 

We  define  the  speedup  of  a  parallel  algorithm  over  a  serial  one  to  be  the  time  required  by 
the  serial  algorithm  divided  by  the  time  for  the  parallel  algorithm.  We  will  restrict  our  attention 
to  parallel  computers  built  as  a  tree  of  serial  computers.  A  node  in  this  tree  is  a  processor.  A 
processor's  parent  is  its  master,  and  its  child  is  its  slave.  If  each  interior  processor  has  n  slaves, 
we  say  that  the  tree  has  fanout  n. 

3.  RELATED  WORK 

3.1.  Parallel-Aspiration  Search 

In  order  to  introduce  parallelism,  Baudet3  rejects  decomposition  of  the  lookahead  tree  in 
favor  of  a  parallel  aspiration  search,  in  which  all  slave  processors  search  the  entire  lookahead 
tree,  but  with  different  initial  a-/ 3  windows.  These  windows  are  disjoint,  and  in  the  simplest 
variant  their  union  covers  the  range  from  -  °°  to  +  Since  each  window  is  considerably 
smaller  than  (-  °°,+  °°),  each  processor  can  conduct  its  search  more  quickly.  When  the  pro¬ 
cessor  whose  window  contains  the  true  negamax  value  of  the  tree  finishes,  it  reports  this  value, 
and  move  selection  is  complete.  Baudet  analyzes  several  variants  of  this  algorithm  under  the 
assumption  of  randomly  distributed  terminal  values,  and  concludes  that  the  obtainable  speedup 
is  limited  by  a  constant  independent  of  the  number  of  processors  available.  This  maximum  is 


established  to  be  approximately  5  or  6.  Surprisingly,  for  k  equal  to  2  or  3,  Baudet’s  method 
yields  more  than  k-fold  speedup  with  k  processors.  Baudet  infers  that  the  serial  a-/3  search 
algorithm  is  not  optimal,  and  estimates  that  a  15  to  25  percent  speedup  may  be  gained  by  start¬ 
ing  the  search  with  a  narrow  window. 

Since  a  narrow  window  does  not  speed  up  a  successful  search  when  moves  are  ordered 
best-first,  Baudet’s  method  yields  no  speedup  under  best-first  move  ordering. 

3.2.  Mandatory-Work-First  Search 

Akl,  Barnard,  and  Doran4  report  simulation  measurements  of  a  parallel  tree-decomposing 
alpha-beta  algorithm.  This  algorithm  distinguishes  between  those  parts  of  a  subtree  that  must 
be  searched  and  those  parts  whose  need  to  be  searched  is  contingent  upon  search  results  in 
other  parts  of  the  tree.  By  searching  mandatory  nodes  first,  their  algorithm  attempts  to  achieve 
as  many  of  the  cutoffs  seen  in  the  serial  case  as  possible. 

We  have  analyzed  a  parallel  algorithm  based  on  these  ideas,  and  will  report  on  it  else¬ 
where. 

4.  THE  TREE-SPLITTING  ALGORITHM 

We  now  describe  a  parallel  algorithm  for  implementing  «-/3  search  on  a  tree  of  processors. 
The  root  processor  evaluates  the  root  position.  Each  interior  processor  evaluates  its  assigned 
position  by  generating  the  successors  and  queuing  them  for  parallel  assignment  to  its  slave  pro¬ 
cessors.  Thus  a  processor  at  level  N  in  the  processor  tree  always  evaluates  positions  at  level  N 
in  the  lookahead  tree.  As  an  interior  processor  receives  responses  from  its  slaves,  it  narrows  its 
window  and  tells  working  slaves  about  the  improved  window.  When  all  successors  have  been 
evaluated  (or  a  cutoff  has  occurred),  the  interior  processor  is  able  to  compute  the  value  of  its 
position.  Each  leaf  processor  evaluates  its  assigned  position  with  the  serial  a- (3  algorithm. 
When  a  processor  finishes,  it  reports  the  value  computed  to  its  master. 

4.1.  The  Leaf  Algorithm 

The  leaf  algorithm  runs  at  leaf  nodes  of  the  processor  tree.  We  will  describe  its  interac¬ 
tions  with  its  master  by  means  of  remote  procedure  calls.5  The  algorithm  can  also  be  expressed 
in  a  message-passing  or  shared-memory  form.  The  master  calls  the  function  leafa/J  (line  19) 
remotely.  A  master  can  interrupt  a  search  in  progress  to  tell  its  slave  of  a  newly-narrowed  win¬ 
dow  by  invoking  the  asynchronous  "update"  procedure  in  the  slave  (line  3).  The  variables  « 
and  0  (line  1)  are  global  arrays,  not  formal  parameters,  in  order  to  facilitate  updating  their 
values  in  each  recursive  call  of  alphabeta  when  the  new  window  arrives. 


Here  is  the  leaf  algorithm: 

I  a,/3  :  array  [1..MAXDEPTH]  of  integer, 

3  asynchronous  procedure  update  (newa,  new/3  :  integer)-, 

4  i  update  is  called  asynchronously  by  my  master 

5  to  inform  me  of  the  new  window  (newa,new/3)  ) 

6  var  tmp  :  integer-, 

7  k  :  1..MAXDEPTH; 

8  begin 

9  fork:-  l  to  MAXDEPTH  do 

10  begin  {  update  a,/3  arrays  ) 

II  a[k]  :  —  max(a[k],newa); 

12  /3[k]  min  (/3[k], new/3); 

13  tmp  newa; 

14  newa  -new/3; 

15  new/3  -tmp; 

16  end 

17  end', 

19  Junction  leafa/3(p  :  position;  a,/3  :  integer )  :  integer, 

20  begin 

21  atll  a; 

22  /3[1]  /3; 

23  rafura(alphabeta{p,l»; 

24  end  , 

26  Junction  alphabets  (p:position;  depth:  integer) :  integer: 

27  var  succ:  array  1 1..MAXCHILD]  of  position;  (successors) 

28  succno  :  1..MAXCHILD;  (  which  successor  1 

29  succlim  :  1..MAXCKILD;  {  how  many  successors  ) 

30  begin 

31  determine  the  successors  succ[l] . succ  [succlim]; 

32  if  succlim  —  0  then  ramra(staticvalue(p)); 

33  for  succno  l  to  succlim  do 

34  begin  {  evaluate  each  successor  ) 

35  a[depth+l]  :— - /3[depth]; 

36  /3[depth  +  lj  -  aldepthj; 

37  a[depth]  max(a[depth],  -  alphabeta(succ[succno],depth  +  D); 

38  if  a  [depth]  ^  /3[depth]  then  remrnia  [depth]);  {  cutoff  occurs  1 

39  end  {  for  succno  ) 

40  remra  (a  [depth]); 

41  end:  (  function  alphabets  ) 

4.2.  The  Interior  Algorithm 

The  interior  algorithm  interiora/3  runs  on  interior  nodes  of  the  processor  tree.  When 
interiora/3  is  activated,  it  generates  all  successors  of  the  position  to  be  evaluated  (line  25). 
Each  of  its  slaves  is  requested  to  evaluate  one  of  these  positions;  the  remaining  positions  are 
queued  for  later  service.  This  queue  is  implemented  by  the  parallel  for-loop  (lines  30  to  42). 
A  separate  process  is  created  (line  30)  for  each  successor,  and  each  process  attempts  to  gain 
exclusive  control  of  a  slave  processor  (line  32).  When  successors  outnumber  slaves,  some 
processes  must  remain  blocked  within  "idle  slave"  until  slaves  can  be  allocated  to  them. 

Interiora/3  may  take  various  actions  when  a  slave  returns.  First,  if  the  returned  value 


causes  the  current  a  value  to  increase,  then  interiora/3  sends  -a  as  an  updated  /3  value  to  all 
active  slaves  (line  39).  Second,  if  a  has  been  increased  so  that  it  becomes  greater  than  or  equal 
to  /3 ,  then  an  a-/3  cutoff  occurs.  The  nonpositive-width  window  is  sent  to  all  active  slaves, 
quickly  terminating  them  (line  39).  Meanwhile,  interiora/3  empties  its  queue  of  waiting  succes¬ 
sor  positions.  (In  the  algorithm  shown  below,  this  effect  is  achieved  by  the  test. on  line  33.) 
Third,  if  the  queue  of  unevaluated  successor  positions  is  non-empty,  the  reporting  slave  is 
assigned  the  next  position  from  the  queue. 

If  interiora/3  is  interrupted  by  an  update  call  from  its  master,  it  relays  this  new  window  to 
its  slaves  (lines  3  to  14). 

When  all  successors  have  been  evaluated,  interiora/3  returns  the  final  value  to  its  master 
(line  43).  In  a  game  situation,  the  algorithm  at  the  root  node  might  serve  as  the  user  interface, 
and  would  remember  which  move  has  the  maximum  value. 


Here  is  the  interior  algorithm: 

1  var  gla,gl/3  :  integer ;  (  global  a,/3  I 

2  q  :  integer ;  |  depth  of  processor  tree  ) 

3  asynchronous  procedure  update(newa,  new/3  :  integer)-, 

4  {  update  is  called  asynchronously  by  my  master 

5  to  inform  me  of  the  new  window  (newa, new/3)  ) 

6  begin 


7 

atomically  do 

8 

begin 

9 

gla  max (gla, newa); 

10 

gl/3  min(gl/3,new/3); 

11 

end ;  l  atomically  do  ) 

12 

for  all  slaveid  do 

13 

slaveid.update(-gl/3,-gla); 

14  end-. 

{  update  } 

16  Junction  interiora/3(p:  position  ;  a,/3:  integer )  :  integer ; 

17  var  succ:  orra  v[l..MAXCHILD]  of  position;  {  successors  I 

18 

succno  :  1..MAXCHILD;  1  which  successor  1 

19 

succlim  :  1..MAXCHILD;  {  how  many  successors  ) 

20 

tmp  :  arravfl..MAXCHILD}  of  integer-. 

21 

Junction  g  :  integer. 

22  begin 

23 

gla  :**  a; 

24 

gl/3  0; 

25 

determine  the  successors  succll],  succ  [succlim]; 

26 

if  succlim  —  0  then  return  (staticvalue(p) ) ; 

27 

//depth (succ [1])  <  q  then 

28 

g  :  —  interiora/3; 

29 

else  g  :*■  leafa/3; 

30 

parfor  succno  :**  l  ro  succlim  do 

31 

begin 

32 

slaveid  idle_slave(); 

33 

if  gla  <  gl/3  then 

34 

begin 

35 

tmplsuccno]  :■*  -slaveid.g(succ[succnol,-gl/3,-gla); 

36 

if  tmplsuccno]  >  gla  then 

37 

begin 

38 

atomically  do  gla  :■*  max  (tmplsuccno],  gla); 

39 

for  all  slaveid  do  slaveid.  update  (-gl/3, -gla); 

40 

end;  [  if  tmpfsuccno]  >  gla  1 

41 

end;  I  if  gla  <  gl/3  ) 

42 

end;  !  parfor  succno  } 

43 

return  (gla)  ; 

44  end', 

,  1  interior ) 

5.  OPTIMIZATIONS 

Since  the 

tree-splitting  algorithm  can  be  optimized  in  several  ways,  it  should  be  con- 

sidered  the  simplest  variant  of  a  family  of  tree-decomposing  algorithms  for  a-/3  search.  As  a 
first  optimization,  since  most  of  a  master’s  time  is  spent  waiting  for  messages,  that  time  could 
be  spent  profitably  doing  subtree  searches.  However,  only  the  deepest  masters  could  hope  to 
compete  with  their  slaves  in  conducting  searches.  All  other  masters  are  by  themselves  slower 

than  their  slaves  because  their  slaves  have  slaves  below  them  to  help.  However,  more  than 
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half  of  all  masters  control  leaf  processors,  and  greater  speedup  should  be  achieved  by  running  a 
leaf  algorithm  along  with  these  masters  on  the  same  processors.  We  might  expect  an  additional 
1.5-fold  speedup  from  this  technique. 

A  second  optimization  groups  several  higher-level  masters  onto  a  single  processor.  For 
example,  the  3  highest  processors  in  a  binary  processor  tree  could  be  replaced  by  3  processes 
running  on  a  single  processor. 

Finally,  the  root  processor  may  send  a  special  a-/3  window  to  the  slave  working  on  the  last 
unevaluated  successor.  This  window  is  (-a-l,-a)  instead  of  the  usual  (-/3,-a).  If  that  suc¬ 
cessor  is  not  the  best,  then  the  slave's  search  will  fail  high  as  usual,  but  the  minimal  window 
speeds  its  search.  If  that  successor  is  best,  then  the  smaller  window  causes  the  search  to  fail 
low,  again  terminating  faster.  In  either  case,  the  root  processor  determines  which  successor  is 
the  best  move,  even  though  its  value  may  not  be  calculated.  By  speeding  the  search  of  the  last 
successor,  the  idle  time  of  the  other  slaves  is  reduced.  (This  narrow  window  given  to  the  root’s 
last  subtree  search  can  also  be  used  in  serial  a-/3  search.) 

We  can  generalize  this  technique  in  the  following  way,  called  alpha  raisin r:  Suppose  that 
each  successor  of  the  root  is  being  evaluated  by  a  different  slave,  and  that  slavey 's  current  a 
value,  ai,  is  lower  than  any  other,  and  that  slave2  has  the  second  lowest  a  value,  say  a2- 
Update  to  a2-l,  speeding  up  slave\.  If  this  update  causes  slaveys  otherwise  successful 
search  to  fail  low,  then  the  reported  value  is  still  lower  than  all  others,  and  that  move  is  still 
discovered  to  be  best. 

6.  MEASUREMENTS  OF  THE  ALGORITHM 

Two  sets  of  measurements  were  taken.  The  first  set  was  taken  on  a  network  of  LSI- 11 
microcomputers  running  under  the  Arachnet  operating  system.6  The  second  set  was  taken  by 
simulation. 

The  game  of  checkers  was  used  to  generate  lookahead  trees.  Static  evaluation  was  based 
on  the  difference  in  a  combination  of  material,  central  board  position  for  kings  and  advance¬ 
ment  for  men.  Moves  were  ordered  best-first  according  to  their  static  values.  General  a- 
raising  was  not  employed,  except  for  the  special  case  for  the  last  successor.  Ten  board  positions 
were  chosen  for  use  in  these  experiments.  These  positions  actually  arose  during  a  human- 
machine  game;  they  span  the  entire  game.  All  lookahead  trees  from  these  positions  were 
expanded  to  a  depth  of  8. 

6.1.  Measurements  on  Arachne 

A  single  LSI-11  machine  searches  lookahead  trees  at  a  rate  of  about  100  positions  per 
second.  Inter-machine  messages  can  be  sent  at  a  rate  of  about  70  per  second.  Only  5  proces¬ 
sors  were  available  in  Arachne  at  the  time  of  these  experiments,  so  it  was  not  possible  to  use 
Arachne  to  test  processor  trees  of  height  greater  than  one.  Each  of  the  ten  board  positions  was 
evaluated  with  the  serial  algorithm,  and  with  processor  trees  of  height  one  and  fanout  two  and 
three.  For  each  processor  tree  of  height  q,  fanout  f,  and  k  -  fq  leaf  processors.  Table  1  gives 
the  minimum,  average,  maximum,  and  standard  deviation  of  the  speedups  in  evaluating  the  ten 
lookahead  trees. 


t  We  have  changed  the  name  of  the  Roscoe  distributed  operating  system  to  Arachne.  since  Roscoe  is  a  re¬ 
gistered  trademark  of  Applied  Data  Research,  Incorporated. 


# 
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q 

f 

k 

min 

avg 

max 

std 

l 

2 

2 

1.37 

1.81 

2.36 

0.31 

l 

3 

3 

1.37 

2.34 

3.15 

0.56 

Table  1.  Speedup  on  Arachne. 

Surprisingly,  more  than  k-fold  speedup  was  occasionally  achieved  with  k  slaves:  Three  out 
of  the  ten  positions  were  sped  up  by  more  than  2  with  2  slaves,  and  two  of  those  three  were 
sped  up  by  more  than  3  with  3  slaves.  In  each  such  case,  subtree  evaluations  finished  in  a 
different  order  than  they  were  assigned.  While  one  large  subtree  was  being  evaluated  by  one 
slave,  another  smaller  subtree  was  assigned  and  finished.  The  large  subtree's  evaluation  then 
received  a  call  on  "update'*  that  sped  it  up  or  even  terminated  it. 

6.2.  Simulation  Measurements 

Binary  and  ternary  processor  trees  of  depth  one,  two,  and  three  were  simulated  on  the 
UNIXt  operating  system.  Processors  were  simulated  with  processes,  communications  lines 
with  pipes.  Within  the  simulation,  the  time  for  evaluation  of  one  terminal  position  was  set  at 
ten  units  of  time.  The  time  for  remotely  calling  a  procedure  and  for  returning  from  a  remote 
call  were  each  set  at  seven  units  of  time.  Table  2  gives  the  minimum,  average,  maximum,  and 
standard  deviation  of  the  speedups  in  evaluating  the  ten  lookahead  trees. 


q 

f 

k 

min 

avg 

max 

std 

1 

2 

2 

1.21 

1.57 

2.11 

0.28 

1 

3 

3 

1.25 

2.04 

3.27 

0.57 

2 

2 

4 

1.71 

2.37 

3.33 

0.42 

2 

3 

9 

2.13 

3.55 

6.14 

1.00 

3 

2 

8 

1.58 

3.12 

4.58 

0.80 

3 

3 

27 

1.95 

5.31 

7.95 

1.66 

Table  2.  Simulated  speedup. 

Since  most  game-playing  programs  must  make  their  move  within  a  certain  time  limit,  any 
speedup  in  tree  search  ability  will  generally  be  used  to  search  a  deeper  lookahead  tree.  If  we 
have  an  unlimited  supply  of  processors  to  form  into  a  binary  tree,  we  can  obtain  an  unlimited 
speedup  only  if  the  search  is  not  limited  in  time.  Otherwise  we  cannot,  because  we  would 
eventually  violate  our  premise  that  the  lookahead  tree  is  at  least  as  deep  as  the  processor  tree. 
A  new  layer  on  the  processor  tree  does  not  buy  another  full  ply  in  the  lookahead  tree.  For 
example,  several  speedups  of  1.5  would  be  needed  to  search  a  6-times  larger  chess  lookahead 
tree,  or  about  one  additional  ply.  The  depth  of  the  processor  tree  would  grow  faster  than  the 
depth  of  the  tree  it  searches  and  eventually  would  catch  up.  The  only  way  to  avoid  this  limit  is 
to  increase  the  fan-out  of  the  processor  tree.  If  the  fan-out  is  high  enough  that  no  successor 
need  ever  be  queued  for  evaluation  by  a  slave,  then  the  size  of  the  maximum  lookahead  tree 
that  can  be  evaluated  within  the  time  limit  is  limited  only  by  the  time  required  for  calls  on  inte- 
riora/3  to  propagate  from  the  root  to  the  leaves.  Long  before  this  limitation  is  reached,  we 
would  run  out  of  silicon  for  making  the  processors. 


t  UNIX  is  a  Trademark  of  Bell  Laboratories. 


7.  ANALYSIS  OF  SPEEDUP 

We  now  turn  to  a  formal  analysis  of  the  speedup  that  can  be  gained  in  searching  large 
lookahead  trees  as  the  number  of  available  processors  grows  without  bound.  For  this  purpose 
we  introduce  Palphabeta,  a  simplified  version  of  the  tree-splitting  algorithm.  This  algorithm  is 
in  general  less  efficient  than  the  version  already  discussed,  but  is  more  amenable  to  analysis. 
Much  of  the  analysis  in  this  section  is  a  "parallelization"  of  results  of  Knuth  and  Moore.2 
Indeed,  when  q  —  0,  Theorem  1  and  Corollary  1  reduce  to  their  results. 

As  before,  the  processors  will  be  arranged  in  a  uniform  tree.  Let  f  ^  1  be  the  fan-out  of 
the  processor  tree  (uniform  for  all  interior  nodes),  and  let  q  >  1  be  its  depth  (uniform  for  all 
leaf  nodes).  Let  q  +  s  be  the  depth  of  the  lookahead  tree,  where  s  >  1.  We  assume  that  the 
lookahead  tree  has  a  uniform  degree  and  that  this  degree,  df,  is  a  multiple  of  f,  where  d  is  ^  2. 
Here  is  Palphabeta: 

1  function  Palphabeta(p:position;  a, (2:  integer):  integer ; 

2  var  i  :  integer, 

3  function  g  :  integer ; 

4  begin 

5  determine  the  successors  P] . p^. 

6  if  depth(pj)  <  q  then  '* 

7  g  :«■  Palphabeta 

8  else  g alphabets; 

9  for  i :  *  1  to  &  do 

10  begin 

11  a:—max(a,  max  -g(p„-/3,-a)); 

O—l)  /</</•  /■ 

12  if  a  ^  ft  then  return  (a); 

13  end ; 

14  return  (a) ; 

15  end ; 

The  f  calls  to  function  g  in  line  11  are  intended  to  occur  in  parallel,  activating  functions 
existing  on  each  of  the  f  slaves.  Serial  a-/3  search  is  activated  on  leaf  slaves;  Palphabeta  is 
activated  on  all  others.  Unlike  the  tree-splitting  algorithm,  Palphabeta  waits  until  all  slaves 
finish  before  assigning  additional  tasks.  However,  the  two  algorithms  behave  identically  when 
searching  either  a  best-first  or  worst-first  ordered  "theoretical”  tree  of  uniform  degree  and 
depth.  When  we  restrict  ourselves  to  one  of  these  lookahead  trees,  we  can  make  conclusions 
about  the  behavior  of  the  tree-splitting  algorithm  by  studying  Palphabeta. 

7.1.  Worst -first  ordering 

a-i 3  search  produces  no  cutoffs  if,  whenever  the  call  alphabeta(p,a,/3)  is  made,  the  follow¬ 
ing  relation  holds  among  the  successors  p\ ,  ...,  pd: 

a  <  -negamax(pi)  <  ...  <  -negamax(prf)  <  p. 

We  call  this  ordering  worst  first.  If  no  cutoffs  occur,  it  is  easy  to  calculate  the  time  necessary  for 
Palphabeta  to  finish.  Assume  that  a  processor  can  generate  f  successors,  send  messages  to  all 
of  its  f  slaves  and  receive  replies  in  time  p.  (This  figure  counts  message  overhead  time  but 
does  not  include  computation  time  at  the  staves.)  Assume  also  that  the  serial  a-p  algorithm 
takes  time  n  to  search  a  lookahead  tree  with  n  terminal  positions.  Let  a„  be  the  time  necessary 
for  a  processor  at  distance  n  from  the  leaves  to  evaluate  its  assigned  position.  A  leaf  processor 
executes  the  serial  algorithm  to  depth  s.  Thus  we  have  a0”  (df)'.  An  interior  processor  gives 
d  batches  of  assignments  to  its  slaves,  and  each  batch  takes  time  p  plus  the  time  for  the  slave 
processor  to  complete  its  calculation.  Thus  we  have  a„+\  —  d*(p  +  o„).  The  solution  to  this 
recurrence  relation  is 
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p 


d**x-d 
d- 1 


+ 


</,+s/5. 


which  is  the  total  time  for  Palphabeta  to  complete.  Since  the  time  for  the  serial  algorithm  to 
examine  the  same  tree  is  (df)Q+\  the  speedup  for  large  s  is  fq.  There  are 

f+'-l 
f- 1 

processors,  roughly  f,  so  when  no  pruning  occurs  the  parallel  algorithm  yields  speedup  that  is 
roughly  equal  to  the  number  of  processors  used. 


7.2.  Best-first  ordering 

We  will  now  investigate  what  happens  when  the  lookahead  tree  is  ordered  best-first. 


Definition.  We  will  use  the  Dewey  decimal  system  to  name  nodes  in  both  processor  trees  and 
lookahead  trees.  The  root  is  named  by  the  null  string.  The  j  successors  of  a  node  whose  name 
is  ci\ ...ak  are  named  by  a\...ak\  through  a{...akj. 


Definition.  We  say  that  the  successors  of  a  position  a\...a„  are  in  best-first  order  if 
negamaxla^.a,,)  —  —  negamax(at...a„l). 

Definition.  We  say  a  position  a\...a„  in  the  lookahead  tree  is  (q,f) -critical  if  a,  is  (q.f) -restricted 
for  all  even  values  of  i  or  for  all  odd  values  of  i.  An  entry  a,  is  (q,f)-restricted  if 

1  <  i  <  q  and  1  <  a,  <  f 

"V*  or  q  <  i  and  a,  ~  1. 

Theorem  /:  Consider  a  lookahead  tree  for  which  the  value  of  the  root  position  is  not  ±  °°  and 
for  which  the  successors  of  every  position  are  in  best-first  order.  The  parallel  o-/3  procedure 
Palphabeta  examines  exactly  the  (q,f)-critical  positions  of  this  lookahead  tree. 

Proof.  We  will  call  a  (q,f) -critical  position  ax...a„  a  type  1  position  if  all  the  a,  are  (q,f)- 
restricted;  it  is  of  type  2  if  a,  is  its  first  entry  not  (q,0-restricted  and  n-j  is  even;  otherwise 
(that  is,  when  n-j  is  odd),  it  is  of  type  3.  Type  3  nodes  have  an  (q,0 -restricted.  The  following 
statements  can  be  established  by  induction  on  the  depth  of  the  position  p.  (Text  in  brackets 
refers  to  positions  of  depth  <  q.) 

(1)  A  type  1  position  p  is  examined  by  calling  [P]alphabeta(p,+  «>,—  °°).  If  it  is  not  ter¬ 
minal,  its  successor  positions]  pj,  p2,  ...,  pf ]  is  [are]  of  type  1,  and  F(p)  —  — F(/7j)  ^  ±  °°. 
This  [These]  successor  positional  is  [are]  examined  by  calling  [P]  alphabets  (/>,,-  °°,+  °°). 

The  other  successor  positions  p2 .  •••»  Pdf  \Pr*-\ . Pd  A  are  of  type  2,  and  are  all  examined  by 

calling  [P]alphabeta(pM_  °°,F(p\)). 

(2)  A  type  2  position  p  is  examined  by  calling  [P]alphabeta(p,-  where  -  =»  <  p  ^ 
F(p).  If  it  is  not  terminal,  its  successors]  p,[.  Pi,  ...,  pf  1  is  [are]  of  type  3,  and  F(p)  - 
— F(/>i).  This  [These]  successor  position^]  is  [are]  examined  by  calling 
[P]alphabeta(p(,-/3,+  <»).  Since  F(p)  —  -F(p])  >  cutoff  occurs,  and  [P]alphabeta  does 
not  examine  the  other  successors  Pi, ...,  pdf  \p,± i, ...,  PdA- 

(3)  A  type  3  position  p  is  examined  by  calling  [P]alphabeta(p,a,+  »)  where  F(p)  ^  «  < 
+  oo.  If  it  is  not  terminal,  each  of  its  successors  p,  is  of  type  2,  and  they  are  all  examined  by 
calling  [P]alphabeta(p,,- oo, -a).  All  of  these  searches  fail  high. 

It  follows  by  induction  on  the  depth  of  p  that  the  (q,f)-critical  positions,  and  no  others, 
are  examined. 

□ 


*■- -  ■ — -  «*- 


J  — '  ^ 
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Figure  1  shows  the  best-first  lookahead  tree  of  degree  four  and  depth  four  that  is  exam¬ 
ined  by  alphabeta.  Figure  2  shows  the  best-first  lookahead  tree  of  degree  four  and  depth  four 
that  is  examined  by  Palphabeta  running  on  a  processor  tree  of  fanout  two  and  depth  two. 


Corollary  1.  If  every  position  on  levels  0,  1 . q+s-1  of  a  lookahead  tree  of  depth  q+  s  satis¬ 

fying  the  conditions  of  Theorem  1  has  exactly  df  successors,  for  d  some  fixed  constant,  and  for 
f  the  constant  appearing  in  Palphabeta,  then  the  parallel  procedure  Palphabeta  (along  with 
alphabeta,  which  it  calls),  running  on  a  processor  tree  of  fan-out  f  and  height  q,  examines 
exactly 

yU/2)(dr/)K«+s>/2|  +  y[«/3J(^y)|(«j+j)/2|  _  p 

terminal  positions. 

Proof.  There  are  /*,/2Uflf/)[(,+j)/*  sequences  Oj  ■  •  •  aq+ s,  with  1  ^a,df  for  all  i,  such  that  o,  is 
(q,f)-restricted  for  all  even  values  of  i.  There  are  /l’/3(<//)l(«+s>/3l  such  sequences  with  a, 
(q,f)-restricted  for  all  odd  values  of  i.  We  subtract  fq  for  the  sequences  {1,  ...,  f| "  Is  that  we 
counted  twice. 

□ 

Palphabeta  reduces  to  alphabeta  when  q— 0.  Thus  Corollary  1  tells  us  that  alphabeta 
searches 

rfls/21  +  d\sl*  _  ! 

terminal  nodes  when  searching  a  tree  of  height  s  and  degree  d. 


Lemma  1.  Given  positive  constants  a,  b,  c,  d,  and  p,  the  relations 


aQ  -  a; 

an*\  ~  pd  +  a„  +  (d-\)b„\ 

bo  ^  b'. 

*n+l  "  P  +  c„\ 

Co  *  c  \ 

C/i+i  -  dip  +  b„). 

are  satisfied  by  the  sequences 

(in  even:)  a  +  /?(n)l<f(3p+b+c)+p— b— c]— np, 

(n  odd:)  a  +  h(n—l)[dOp+b+c)+p~b—c)~np  +  d{n~x)n{d(p+b)+p-b)\ 

( n  even:)  p  +  2pg(n)  +  ( p+b)dnl 2, 

( n  odd:)  p  +  2pg(n+\)  +  cd^"~i)/2; 

( n  even)  2pg(n+2)  +  cdn/2, 

(n  odd:)  2pg(n+l)  +  ( p+b)din+x)/2\ 
where  the  function  g  is  defined  by 


b„ 


Cn 


and  the  function  h  is  defined  by 


Proof,  straightforward  algebra. 


g(n) 


h(n) 


dn/2-d 
d- 1  ’ 


dn/2~  1 
d- 1 


□ 


Theorem  2:  Under  the  conditions  of  Corollary  1,  and  assuming  also  that  (1)  serial  a-j3  search  is 
performed  in  time  equal  to  the  number  of  leaves  visited,  and  (2)  in  p  units  of  time,  a  processor 
can  generate  f  successors  of  a  position,  send  a  message  to  each  of  its  f  slaves,  and  receive  the  f 
replies,  the  total  time  for  Palphabeta  to  complete  is 
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i 


t  : 
t 


P 


r- 


Pi 


i 


( q  even:)  (d/P'*  +  (df)^  -  1 

+  h(q)[dOp  +  ( d f)^  +  (df)'s/*)  +  p  -  (dfP'*  -  (rf/)|s/J|l  -  pq. 


( q  odd\)  (df) ls/Jl  +  (d/)!5'*  -  1 

+  A(<? — l)[d(3p  +  (df)^  +  (df)\sl*)  +  p  —  (<//)ls/Jl  -  (d/)ls/*]  —  pp 
+  </‘'-I)/2[</(p  +  U/)ls/21)  +  p  -  (d/)*j/2»l. 

Proof:  Let  a„,  b„,  and  cn  represent  the  time  required  for  a  processor  at  distance  n  from  the 
leaves  of  the  processor  tree  to  search  type  1,  2,  and  3  positions,  respectively.  A  leaf  processor 
searching  a  type  1  position  is  actually  performing  the  serial  algorithm  on  a  tree  of  height  s  and 
degree  df.  Hence  by  Corollary  1, 

flo  “  (df) M  +  (df) |J/*  -  1. 

Counting  arguments  similar  to  those  in  Corollary  1  give  us 

bo-  (df)W* 
and 

c0  “  (d/),s/3- 

In  order  to  evaluate  a  type-1  position,  an  interior  processor  at  height  n+1  in  the  processor  tree 
orders  its  slaves  to  evaluate  d  batches  of  successors.  The  first  batch  consists  of  type  1  posi¬ 
tions,  and  the  remaining  d-1  batches  consist  of  type  2  positions.  Hence 

o„+ 1  *  pd  +  a„  +  (d~\)b„ 

Similar  arguments  give  us 


and 


*/.+!  ”  P  +  Cn 


C„+ 1  -  dip  +  bn) 

By  substituting  the  constant  expressions  for  a0,  6o,  and  c0  to  find  aq  by  the  formulas  given  by 
Lemma  1 ,  we  obtain  the  desired  formula. 

□ 

Under  conditions  of  best-first  search,  the  parallel  a-/3  algorithm  gives  order  of  kVl  speedup 
with  k  processors  for  searching  large  lookahead  trees.  The  next  theorem  formalizes  this  result: 

Theorem  3:  Suppose  that  Palphabeta  runs  on  a  processor  tree  of  depth  q  >  1  and  fan-out  f  >  1. 
Suppose  that  the  lookahead  tree  to  be  searched  is  arranged  in  best-first  order  and  is  of  degree  df 
and  depth  q+s,  where  d  >  1.  Denote  by  R  the  time  for  alphabets  to  search  this  tree,  and  by  P 
the  time  for  Palphabeta  to  search  the  tree.  Then 

lim  R/P  -  r>n. 

5—oo 


Proof:  The  time  for  the  serial  algorithm  is 

(df) l(l+*)/2l  +  (tf/-)(<s+’)/2'  -  1, 

from  Corollary  1.  If  we  divide  this  quantity  by  the  expression  given  by  Theorem  2  for  P,  and 
take  the  limit  as  s  goes  to  °°,  we  obtain  the  desired  result. 

□ 


V 
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7.3.  Random  Order 

We  have  calculated  the  finishing  time  of  the  tree-splitting  algorithm  for  both  best-first  and 
worst-first  ordering  of  terminal  positions.  Another  ordering,  called  random  order,  assumes  that 
terminal  values  are  independent,  identically  distributed  random  variables.  Restated,  this 
assumption  says  that  no  two  terminal  values  are  equal,  and  that  any  one  of  the  n!  orderings  of 
the  terminal  values  is  as  likely  as  any  other. 

We  have  partially  analyzed  a  weaker  form  of  Palphabeta,  called  Pbound,  under  the 
assumption  of  random  order.  This  analysis  derives  the  expected  number  of  terminal  positions 
visited  by  Pbound.  Unfortunately,  it  does  not  yield  the  finishing  time,  since  Pbound  sometimes 
requires  processors  to  be  idle.  The  interested  reader  is  referred  to  the  technical  report.7 

8.  DISCUSSION 

The  improvement  that  alphabeta  search  shows  over  negamax  search  is  due  to  the  cutoffs 
it  achieves.  Parallel  execution  tends  to  lose  some  of  that  advantage,  since  subtrees  that  the 
serial  algorithm  would  avoid  are  searched  before  information  is  available  to  cut  them  off.  This 
situation  is  most  extreme  if  the  lookahead  tree  is  ordered  best-first;  in  this  case  the  serial  algo¬ 
rithm  enjoys  the  most  cutoffs.  However,  our  analysis  shows  that  even  in  this  case,  order  of  k,/: 
speedup  can  still  be  expected.  At  the  other  extreme,  if  the  lookahead  tree  is  ordered  worst- 
first,  then  no  cutoffs  are  found  in  either  the  serial  or  the  parallel  algorithm.  In  this  case,  the 
parallel  algorithm  performs  no  wasted  work,  and  speedup  is  order  of  k. 

We  can  now  compare  the  measurements  presented  in  Section  5  with  these  theoretical 
bounds.  Table  3  compares  theoretically-predicted  speedups  with  measured  speedups  for  proces¬ 
sor  trees  of  height  one  and  two,  and  of  fan-out  two  and  three. 


Q 

/ 

k 

theoretically-predicted 

measured 

worst-first 

best-first 

Arachne  simulation 

1 

2 

2 

2 

1.41 

1.81  1.57 

1 

3 

3 

3 

1.73 

2.34  2.04 

2 

2 

4 

4 

2.00 

2.37 

2 

3 

9 

9 

3.00 

3.55 

3 

2 

8 

8 

2.83 

3.12 

3 

3 

27 

27 

5.20 

5.31 

Table  3:  Speedup 


In  checkers,  certain  simplifying  assumptions  used  for  the  analysis  are  not  true.  The 
lookahead  tree  is  neither  regular  nor  ordered  best-  (nor  worst-)  first.  Further,  the  degree  of 
each  interior  node  is  not  a  fixed  multiple  of  f.  Therefore,  slave  processors  do  not  finish  in 
unison.  Nonetheless,  our  implementation  results  with  checkers  display  speedups  that  lie 
between  the  two  analytically  derived  extremes.  These  limited  results  show  that  the  formal  ana¬ 
lyses  are  not  unreasonable. 

9.  CONCLUSIONS 

The  a-t 3  algorithm  is  central  to  many  game-playing  programs.  Attempts  to  speed  up  this 
algorithm  have  usually  taken  the  form  of  care  to  order  moves  in  a  good  approximation  to  best- 
first  order  and  special  hardware  for  static  evaluation  and  move  generation. 

This  paper  investigates  another  line  of  attack:  decomposition  of  a-j3  search  for  parallel 
execution  on  a  multicomputer.  The  tree-splitting  decomposition  investigated  here  assigns 


different  nodes  of  the  lookahead  tree  to  processors  in  a  processor  tree.  The  penalties  for  this 
sort  of  decomposition  are  twofold:  Communication  costs  are  introduced,  and  some  work  is  per¬ 
formed  that  the  serial  algorithm  avoids.  Our  implementation  measurements  and  formal  analysis 
show  that  these  penalties,  although  present,  do  not  prevent  decomposition  from  achieving  arbi¬ 
trarily  high  speedup.  Although  we  do  not  reach  k-fold  speedup  for  k  processors,  we  expect  to 
achieve  at  least  order  of  Ar1/2-fold  speedup.  The  loss  of  efficiency  is  due  almost  entirely  to  lost 
cutoffs;  communication  overhead  is  insignificant. 

There  are  other  promising  decompositions  of  a-/3  for  parallel  execution;  in  particular,  the 
"mandatory-work-first"  decomposition  of  Akl  et  alA  suggests  other  dynamic  allocations  of 
unfinished  work  to  processors  that  may  result  in  even  greater  speedup. 
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Abstract 

Arachne  is  a  multi-computer  operating  system  running  on  a 
network  of  LSI-li  computers  at  the  University  of  Wisconsin.  This 
document  describes  the  implementation  of  the  Arachne  kernel  at 
the  level  of  detail  necessary  for  a  programmer  who  intends  to  add 
a  module  or  modify  the  existing  code.  Companion  reports  describe 
the  purposes  and  concepts  underlying  the  Arachne  project,  present 
the  implementation  details  of  the  utility  processes,  and  display 
Arachne  from  the  ">oint  of  view  of  the  user  program. 
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buffer  is  rsturnad.  If  the  priority  is  2,  then  a  buffer  is  only 
given  if  there  are  at  least  1/4  of  the  original  buffers  free.  If 
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least  1/2  of  the  original  buffers  free.  These  distinctions  are 
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Abstract 

Roscoe  is  a  multi-computer  operating  system  running  on  a 
network  of  LSI-11  computers  at  the  University  of  Wisconsin.  Ros-  i 

coe  consists  of  a  kernel  program  resident  on  each  computer  and 
several  utility  processes.  This  document  describes  the  implemen¬ 
tation  of  the  Roscoe  utility  processes  at  the  level  of  detail 
necessary  for  a  programmer  who  intends  to  add  a  module  or  modify 
the  existing  code.  Companion  reports  describe  the  purposes  and 
concepts  underlying  the  Roscoe  project,  present  the  implementa¬ 
tion  details  of  the  kernel,  and  display  Roscoe  from  the  point  of 
view  of  the  user  program. 
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ROSCOE  UTILITY  PROCESSES 


This  paper  documents  the  source  code  for  the  following  Roscoe 
utilities: 

resource  manager 

terminal  driver 

command  interpreter 

file  manager 

the  "demon"  (a  PDP-11/40  process  with  which  the  file  manager 
communicates) 

user-callable  library  routines 

copyfile  (a  program  used  implicitly  by  the  command  inter¬ 
preter) 

The  reader  is  assumed  to  be  familiar  with  the  Roscoe  User 
Guide  [Tischler,  Solomon,  and  Finkel  78] ,  which  describes  the 
purposes  and  use  of  these  utilities.  The  present  paper  consists 
of  a  detailed  explanation  of  the  programs  and  data  structures  for 
those  who  intend  to  help  maintain  these  utilities.  The  Roscoe 
kernel  code  is  similarly  documented  [Finkel  and  Solomon  78]. 

The  documentation  given  here  is  accurate  as  of  January  20, 
1979.  However,  recent  developments  will  soon  cause  some  modifi¬ 
cations  to  the  processes  discussed  here.  In  particular,  a  new 
utility  process  called  a  "pipe"  has  been  introduced  to  attach  the 
output  of  one  process  to  the  input  of  a  second  one.  The  command 
interpreter  and  the  resource  manager  will  cooperate  to  establish 
piped  processes. 

Unless  otherwise  stated,  all  files  mentioned  are  in  the 
directory  "/usr/network/roscoe/user" . 


1. 


THE  RESOURCE  MANAGER 


1. 1  General 

The  code  lies  in  " resource . u" .  Programs  that  communicate 
with  the  resource  manager  should  include  "resource. h"  unless  all 
such  communication  is  handled  by  library  routines. 

The  resource  manager  may  use  the  service  calls  "load",  "re¬ 
move",  "startup",  and  "kill".  These  calls  are  meant  to  be 
privileged,  although  that  restriction  is  not  yet  enforced.  The 
"startup"  call  gives  the  new  process  a  link  to  its  resource 
manager.  This  link  should  be  of  a  special  kind,  although  the 
resource  manager  currently  refers  to  it  as  a  "REQUEST"  link.  Ei¬ 
ther  a  REQUEST  or  REPLY  link  may  be  enclosed  over  this  link. 
Currently,  the  kernel  does  not  enforce  the  restrictions  on  en¬ 
closed  links.  Furthermore,  the  new  process  may  not  destroy  this 
parent  link  except  by  dying.  This  restriction  is  enforced. 

1.2  Protocols  between  resource  manag ers 

When  resource  managers  talk  to  each  other,  they  send  requests 
whose  "rmreq"  fields  hold  special  values.  These  values  are  de¬ 
fined  by  macros  that  begin  with  the  letters  "RR".  We  will  follow 
the  convention  of  calling  the  originator  of  such  a  message  the 
"first"  resource  manager  and  the  recipient  the  "second".  Togeth¬ 
er,  they  are  called  "colleagues".  When  client  processes  talk  to 
resource  managers,  they  send  requests  whose  "rmreq"  fields  are 
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defined  by  macros  that  begin  with  the  letters  "RM".  Following 
sections  describe  how  these  codes  are  employed  to  carry  out  the 
various  resource  manager  functions. 

The  routine  "sendrm"  is  used  to  send  messages  between 
resource  managers.  The  array  "rmtab",  of  size  5,  keeps  track  of 
which  other  resource  managers  exist.  Entries  in  "rmtab"  are  link 
numbers;  -1  indicates  that  there  is  no  corresponding  resource 
manager.  Links  used  between  resource  managers  use  channel  2,  and 
the  code  is  always  the  machine  number  of  the  holder. 

Resource  managers  may  make  RMFSREQ  or  RMTTREQ  requests  of 
each  other,  in  which  case  the  request  is  treated  the  same  as  any 
other  user's  request,  except  that  for  RMTTREQ,  the  local  terminal 
link  is  assumed  to  be  the  one  desired.  In  addition,  there  are 
five  other  requests  peculiar  to  resource  managers,  as  listed  W-,j 

below.  Whenever  these  requests  are  made,  the  first  resource 
manager  does  not  wait  for  a  reply;  any  reply  that  eventually 
comes  will  be  self-explanatory. 

RRSTART:  This  request  continues  an  RMSTART  request  that  the 

first  resource  manager  could  not  complete.  Everything  in  the 
original  request  is  passed  along;  no  response  is  needed.  The 
resource  managers  pass  the  request  around  in  order  of  increasing 
machine  id.  The  originator  recognizes  it  should  it  return.  This 
circular  method  is  an  ad-hoc  approach  that  will  be  changed  in  the 
future  to  a  more  reasonable  polling  order. 

RRKILL:  This  request  continues  an  RMKILL  request  if  the 


process  targeted  for  the  kill  is  not  local  to  the  first  resource 
manager.  The  request  is  forwarded  to  the  resource  manager  on  the 


proper  machine;  no  response  is  needed. 

RRLINK:  This  message  asks  the  second  resource  manager  for  a 
link  owned  by  that  second  resource  manager.  The  first  resource 
manager  intends  to  give  this  link  to  a  third  resource  manager. 

RRINFORM:  This  message  accompanies  an  enclosed  link  owned 
or  held  by  the  first  resource  manager.  (See  Section  1.3.) 

RRPASS:  Used  to  "pass  the  ball"  when  a  FOREGROUND  process 
"with  the  ball"  for  a  certain  terminal  has  died,  and  the  process 
that  should  next  "get  the  ball"  is  on  another  machine.  (See  Sec¬ 
tion  1.8)  . 

1 . 3  Resource  manager  initial ization 

When  a  resource  manager  is  loaded  by  the  kernel  job  of  the 
Roscoe  kernel,  it  receives  the  machine  number  as  the  argument  to 
"main".  In  particular,  if  the  bit  "NOTPAPA"  is  off,  this 
resource  manager  knows  that  it  is  the  first  one.  We  will  call 
such  a  resource  manager  "original".  A  resource  manager  that  the 
kernel  job  starts  in  an  attempt  to  recover  from  failure  at  some 
node  or  as  a  subsidiary  resource  manager  has  the  bit  NOTPAPA  set. 

When  the  original  resource  manager  starts,  initialization  is 
done  by  "initrmO".  A  file  manager  and  terminal  driver  are  loaded 
and  started  as  DETACHED  processes;  the  file  manager  is  loaded 
manually,  and  does  not  occupy  a  spot  in  "imagetab".  Input  and 
output  terminal  links  are  opened,  a  "configuration"  is  read  from 
the  terminal  by  "readline",  and  the  input  link  is  closed.  The 
"configuration"  is  a  character  string  that  the  resource  manager 
scans  to  determine  what  other  resource  managers  to  load  and  with 
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what  arguments.  For  example,  the  configuration 

1T24FT 

indicates  that  resource  managers  should  be  loaded'  on  machines  1, 
2,  and  4;  machine  1  will  have  its  own  terminal,  machine  4  will 
have  its  own  terminal  and  file  system,  and  machine  2  will  have 
neither . 

The  argument  given  to  a  remote  resource  manager  has  the 
machine  number  as  its  lowest  three  bits  and  contains  flags 
RMTTFLAG  and  RMFSFLAG  to  indicate  respectively  whether  a  terminal 
driver  (and  attendant  command  interpreter)  and  file  manager 
should  be  loaded  locally.  Also,  the  bit  "NOTPAPA"  is  set  to  in¬ 
dicate  that  the  child  resource  manager  is  not  the  first  one 
started.  The  high  order  byte  of  the  argument  gives  the  machine 
number  of  the  parent  (papa) . 

When  a  resource  manager  other  than  the  original  one  starts, 
it  uses  the  initialization  routine  "initrms".  An  entry  is  made 
in  " rmtab"  for  the  first  resource  manager  (the  owner  of  this 
resource  manager's  link  0),  and  an  RRINFORM  message  is  sent  to 
that  resource  manager  with  an  enclosed  link  having  channel  2. 
The  code  for  this  link  is  the  number  of  the  first  resource 
manager.  The  "rmarg"  field  of  this  initial  message  is  -1.  (The 
discussion  of  RRINFORM  messages  continues  below.)  If  the 
RMFSFLAG  is  on,  a  local  file  manager  is  loaded  by  the  routine 
"loadfs",  which  asks  the  first  resource  manager  for  a  file  system 
link,  uses  it  to  perform  the  load,  and  then  destroys  this  unneed¬ 
ed  link.  If  the  RMTTFLAG  is  on,  a  local  terminal  driver  and  com¬ 
mand  interpreter  are  loaded.  If  these  flags  are  off,  the  ap- 


propriate  links  are  obtained  from  the  first  resource  manager  by 
the  same  RMTTREQ  and  RMFSREQ  protocols  followed  by  any  other  pro¬ 
cess. 

The  routines  "loadtt"  and  "loadci"  are  used  to  load  local 
copies  of  the  terminal  driver  and  command  interpreter,  respec¬ 
tively.  No  matter  how  a  terminal  is  obtained,  a  terminal  output 
link  is  automatically  opened.  The  variable  "owntt"  tells  which 
terminal  (0-4)  the  resource  manager  is  using.  The  command  inter¬ 
preter  is  vaccinated  against  control-C's  by  setting  its  "lifeno" 
field  in  "proctab"  to  -1. 

The  routine  "rrinform"  handles  RRINFORM  requests.  If  the 
"rmarg"  field  is  -1,  the  receiver  knows  that  a  new  resource 
manager  just  came  to  life.  The  number  of  this  new  resource 
manager  can  be  found  in  inmess.urcode .  The  receiver  then  acts  as 
the  "papa"  and  sends  out  RRLINK  messages  to  begin  the  process  of 
hooking  together  all  the  other  resource  managers.  Otherwise,  the 
high  order  byte  of  the  "rmarg"  field  tells  the  number  of  the 
resource  manager  that  owns  the  link,  and  the  low  order  byte  tells 
for  which  resource  manager  the  link  is  intended.  If  this  intend¬ 
ed  holder  is  not  the  present  resource  manager,  the  RRINFORM  re¬ 
quest  is  forwarded  to  the  correct  one.  Whenever  a  resource 
manager  receives  a  link  which  it  will  continue  to  hold,  it  up¬ 
dates  its  " rmtab"  accordingly. 

The  routine  "rrlink"  handles  RRLINK  requests  which,  as  men¬ 
tioned  above,  are  only  sent  by  the  original  resource  manager  to 
other  resource  managers.  A  link  is  created  on  channel  2;  the 
code  is  specified  by  "rmarg",  which  indicates  the  resource 
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manager  that  will  eventually  hold  the  link.  The  link  is  sent  to 
the  first  resource  manager  in  an  RRINFORM  message;  it  will  then 
forward  it  to  the  intended  holder,  as  described  above. 

1.4  Wall  clock  synchronization 

When  the  original  resource  manager  starts,  a  special  request 
is  sent  to  the  demon  on  the  PDP-11/40  for  the  Unix  date.  The 
variable  "timewarp"  is  used  to  convert  between  Roscoe  time  and 
Unix  time  (the  former  begins  Jan  1  1973  CST;  the  latter  Jan  1 
1970  GMT).  Other  resource  managers  initialize  their  dates  to 
zero,  but  this  value  is  soon  corrected. 

Whenever  "sendrm"  is  used  (to  send  an  RRSTART,  RRKILL,  RRIN¬ 
FORM,  RRLINK ,  or  RRPASS  message  to  a  colleague) ,  the  current  date 
is  placed  in  the  "update"  field  of  the  message.  Whenever  such  a 
request  is  received,  the  local  date  is  set  to  the  value  in  the 
"update"  field  if  it  is  later.  This  algorithm  keeps  the  wall 
clocks  in  the  various  Roscoe  kernels  from  losing  time  relative  to 
each  other. 

1 . 5  Process  initiation 

The  resource  manager  knows  which  client  sent  each  RMSTART  request 
because  the  code  of  the  link  containing  the  request  is  also  the 
index  for  that  process  in  the  resource  manager's  process  table. 
The  resource  manager  can  also  determine  the  client's  associated 
terminal  from  this  table.  If  the  request  cannot  be  processed  lo¬ 
cally  (either  the  "load"  or  "startup"  service  call  fails  due  to 
lack  of  room) ,  then  an  RRSTART  request  is  sent  to  the  resource 
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manager  with  the  next  higher  machine  number  (modulo  5) ,  as  deter¬ 
mined  from  "rmtab".  The  RRSTART  request  has  all  the  information 
of  the  RMSTART  request  (including  the  same  enclosed  link,  if 
any),  plus  the  client's  process  identifier  and  terminal  number, 
which  would  not  otherwise  be  known  to  the  second  resource 
manager.  The  second  resource  manager  tries  likewise  to  initiate 
the  child  process,  and  if  it  also  fails,  sends  the  request  on 
further.  The  identifier  of  the  child  includes  its  machine  number 
as  its  lowest  three  bits;  if  the  RRSTART  returns  to  the  client's 
resource  manager,  it  is  recognized  as  a  failed  request.  It  may 
be  sent  around  once  more  (if  the  original  method  involved  the 
GENTLY  mode)  but,  in  any  case,  the  buck  stops  somewhere,  either 
with  success  or  failure.  A  reply  (if  required)  is  sent  to  the 
client  from  the  resource  manager  where  the  algorithm  stops. 

The  routines  "rmstart"  and  "rrstart"  are  invoked  for  RMSTAP'1’ 
and  RRSTART  requests,  respectively.  Each  of  these  routines  com¬ 
putes  the  client's  process  identifier  and  terminal  number  in  its 
own  way  and  then  calls  "rawstart".  Another  argument  to  "raws- 
tart"  tells  whether  the  load  should  have  GENTLY  or  ROUGHLY  mode. 
An  RRSTART  message  received  at  the  client's  machine  is  recognized 
by  "rrstart";  if  the  mode  was  GENTLY,  "rawstart"  is  now  called 
with  ROUGHLY  mode;  if  the  mode  was  ROUGHLY,  a  negative  reply  is 
sent  to  the  client.  If  the  load  doesn't  succeed  locally  and 
there  are  no  other  machines,  "rawstart"  similarly  calls  itself 
with  mode  ROUGHLY  (if  the  mode  was  GENTLY),  or  sends  the  user  a 
negative  reply  (if  the  mode  was  ROUGHLY). 

The  routine  "getimage”  loads  a  program.  A  "stat"  checks  that 


9 


the  file  is  a  publicly  executable  load-format  file  and  computes 
its  date  of  last  modification.  If  no  acceptable  copy  already  re¬ 
sides  locally,  a  new  one  is  loaded.  If  "imagetab"  is  full,  an 
unused  image  is  removed  (in  ROUGHLY  mode).  The  procedure  "make- 
room"  is  used  to  remove  unused  images.  Unused  images  are  also 
removed  (in  ROUGHLY  mode)  until  there  is  room  for  the  new  image 
to  be  loaded.  Images  are  removed  in  ascending  order  of  their  in¬ 
dex  in  "imagetab".  When  a  new  image  is  loaded,  "imagetab"  is  up¬ 
dated  accordingly.  Future  developments  should  prevent  loading  if 
a  colleague  has  a  useable  copy,  and  removal  of  images  should 
perhaps  use  some  other  algorithm. 

The  routine  "newproc"  starts  a  process.  The  new  process  is 
given  a  link  to  the  resource  manager  on  channel  1;  this  link  has 
type  REQUEST  and  TELLDEST,  and  its  code  is  the  new  child’s  index 
in  "proctab".  If  the  start  succeeds,  the  corresponding  "count" 
field  in  "imagetab"  is  incremented,  and  an  entry  is  made  in 
"proctab".  If  the  start  fails  because  there  was  no  room  for  the 
process's  stack,  then  "makeroom"  is  called,  as  in  the  case  of 
"getimage"  described  above.  The  lowest  three  bits  of  the  child's 
process  identifier  tell  the  machine  number;  the  variable  "uni- 
quecode"  generates  unique  process  identifiers.  If  the  child  is 
to  be  "DETACHED",  its  lifeline  is  destroyed. 


10 


1 . 6  Process  termination 

The  "rmarg"  field  of  an  RMKILL  request  tells  the  process 
identifier  of  the  victim.  The  resource  manager  figures  out  which 
colleage  hosts  the  victim  by  looking  at  the  lowest  three  bits, 
and  then  either  completes  the  request  itself  or  forwards  an 
RRKILL  request  to  the  appropriate  colleague.  The  RRKILL  request 
contains  the  process  identifier  of  the  client,  which  otherwise 
wouldn't  be  known  to  the  colleague  and  which  is  needed  to  check 
that  the  client  has  permission  to  kill  the  victim.  The  routine 
"rawkill"  checks  this  permission  and  then  performs  the  kill;  the 
victim's  lifeline  isn't  destroyed  (yet).  There's  nothing  to 
prevent  the  parent  of  a  FOREGROUND  process  from  performing  a  kill 
'V  if  it  correctly  guesses  the  child's  process  identifier. 

When  a  client  terminates,  naturally  or  otherwise,  the 
resource  manager  receives  a  DESTROYED  message  on  its  link  and 
calls  "procdie".  The  client's  entry  in  "proctab"  is  deleted  by 
setting  the  "proctype"  field  to  UNUSED,  except  for  FOREGROUND 
processes  (see  further  discussion  below).  The  corresponding 
"count"  field  in  "imagetab"  is  decremented.  The  process's  parent 
link  and/or  lifeline  are  destroyed  if  the  resource  manager  still 
holds  them. 

Tie  termination  of  a  colleague  is  similarly  detected.  The 
routine  "rmdie"  updates  "rmtab"  accordingly. 


I  ■  ■  ■  l  ■  l 


J  •  l  J  "  l1 
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1 . 7  Terminal  links 

When  a  request  for  a  terminal  link  is  received  over  channel 
1 ,  the  corresponding  "ttypeno"  field  in  "proctab"  is  compared  to 
"owntt"  to  see  if  the  local  terminal  link  is  desired.  If  not, 
the  routine  "gettt"  is  used  to  ask  the  appropriate  colleague  for 
a  copy  of  its  terminal  link.  RMTTREQ  requests  received  over 
channel  2  (i.e.,  from  a  colleague)  are  always  given  a  copy  of  the 
local  terminal  link. 


1 . 8  FOREGROUND  processes 

One  linked  list  of  FOREGROUND  processes  is  associated  with 
each  terminal;  at  most  one  terminal  is  owned  by  each  resource 
manager.  The  process  that  "has  the  ball”  (will  be  killed  by  the 
next  Control-C)  is  at  the  head  of  this  list,  and  it  points  to  the 
next  process  to  "get  the  ball".  Segments  of  this  list  reside 
physically  on  each  machine;  each  list  logically  threads  its  way 
among  several  machines. 

Two  fields  in  a  process  table  entry  are  relevant  to  this  dis¬ 
cussion.  The  field  "parentno"  is  the  process  identifier  of  the 
process's  parent;  the  last  three  bits  of  this  number  tell  the 
parent's  machine  number.  The  field  "parentp"  is  the  index  in 
"proctab"  of  the  next  local  item  in  the  FOREGROUND  list.  When  a 
process's  successor  is  on  the  same  machine,  "parentp"  points  to 
it,  and  the  last  three  bits  of  "parentno"  are  the  machine  id; 
when  a  process's  successor  is  on  another  machine,  the  last  three 
bits  of  "parentno" 


tell  which  machine  to  go  to  next,  and 


"parentp"  tells  which  local  process  comes  next  when  the  chain  re¬ 
turns  to  this  machine.  Each  terminal  chain  in  each  resource 
manager  has  a  special  header  node  containing  two  fields:  "fore¬ 
top"  gives  the  index  of  the  first  item  in  the  process  table 
(i.e.f  it  corresponds  to  a  "parentp"),  and  "theball"  is  a  Boolean 
that  tells  whether  this  machine  (i.e.,  the  process  indicated  by 
"foretop")  "has  the  ball".  A  null  pointer  value  for  "parentp"  or 
"foretop"  is  indicated  by  -1. 

When  a  resource  manager  receives  an  RMSTART  request  with 
FOREGROUND  mode,  "rmstart"  checks  that  the  client  "has  the  ball" 
and  turns  off  its  "theball"  flag.  If  the  load  doesn't  succeed 
locally,  "rawstart"  sends  an  RRSTART  message  as  usual.  Wherever 
the  load  succeeds,  the  routine  "newproc"  will  turn  on  the  "the¬ 
ball"  flag,  and  insert  the  new  process  at  the  head  of  the  ap¬ 
propriate  list.  The  routine  "telltt"  is  used  by  "rawstart"  to 
send  a  TOKILL  message  to  a  terminal  driver  (local  or  not),  so  the 
terminal  driver  will  know  which  process  now  "has  the  ball".  If 
the  start  fails,  the  resource  manager  that  initiated  it  notices 
the  request  returning  (perhaps  for  the  second  time);  its  "the¬ 
ball"  flag  is  turned  back  on  by  "rrstart",  so  the  process  that 
previously  "had  the  ball"  still  does. 

When  a  FOREGROUND  process  dies,  "procdie"  marks  the 
corresponding  "proctype"  entry  in  "proctab"  as  DEFUNCT,  rather 
than  UNUSED.  The  "parentno"  and  "parentp"  fields  are  still 
relevant,  so  the  item  is  still  linked  up.  These  DEFUNCT  items 
are  cleaned  off  as  FOREGROUND  processes  die  in  the  "proper"  se¬ 
quence.  Specifically,  when  the  process  "with  the  ball"  dies, 
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"procdie"  turns  off  the  "theball”  flag  and  calls  "rrpass"  to 

clean  off  DEFUNCT  processes  at  the  head  of  the  FOREGROUND  list  so 
long  as  they  point  to  other  processes  on  the  same  machine.  If  a 
non-DEFUNCT  process  is  reached  in  this  manner,  it  then  "has  the- 
ball";  the  "theball"  flag  is  turned  on,  and  "telltt"  is  used  to 

send  an  appropriate  TOKILL  message  to  a  terminal  driver.  On  the 

other  hand,  if  the  list  points  to  another  machine,  an  RRPASS  mes¬ 
sage  is  sent  to  the  appropriate  colleague.  When  a  resource 
manager  receives  an  RRPASS  message,  it  also  uses  "rrpass"  to 

clean  off  DEFUNCT  processes  and/or  "pass  the  ball". 

Files 

resource. u,  resource. h 
Data  Structures 

struct  rmmesg  T  /*  messages  to  resource  managers  */ 
int  rmreq;  /*  type  of  request  */ 

int  rmarg;  /*  various  miscellaneous  arguments  */  _ 

int  rmmode;  /*  the  mode  for  STARTs  or  KILLS  */  Vc 

long  update;  /*  time  field  used  between  R.M.'s  */ 
int  parno;  /*  parent's  proc.  id.,  used  by  R.M.'s  */ 
int  ttno;  /*  a  terminal  number,  used  by  R.M.'s  */ 

}  rmmess;  /*  contents  of  outmess,  used  implicitly  */ 
struct  {  /*  image  table  entry  */ 

char  f name [RMFNAMESZ ] ;  /*  file  name  */ 

long  loadtime;  /*  time  it  was  loaded  */ 

int  count;  /*  number  of  active  processes  */ 

int  proemode;  /*  SHARE,  REUSE,  or  VIRGIN  */ 

int  imageno;  /*  image,  used  for  "start"  or  "remove"  */ 

}  imagetab [NBRIMAGES] ;  /*  image  table  */ 
struct  procnode  {  /*  process  table  entry  */ 

int  proctype;  /*  FOREGROUND,  BACKGROUND,  DETACHED, 

UNUSED,  or  DEFUNCT  */ 
int  parentp;  /*  an  index  in  this  table  */ 
int  parentno;  /*  process  identifier  of  the  parent  */ 

int  plink;  /*  link  supplied  by  parent  during  start  */ 
int  location;  /*  index  into  the  image  table  */ 

int  lifeno;  /*  lifeline,  used  for  "kill"  */ 
int  procid;  /*  its  process  identifier  */ 
int  ttypeno;  /*  its  terminal  number  */ 

}  proctab [NBRPROCSl ;  /*  known  process  table  */ 

Procedures 

main(arg)  _ 

Initializes  the  resource  manager  for  machine  given  in  "arg". 


The 


executes  a  loop  that  receives  and  dispatches  requests, 
argument  also  contains  the  bit  NOTPAPA. 
respond (n) 

Sends  a  one-word  response  to  the  client.  If  it  is  a  nega¬ 
tive  error  indicator,  destroys  the  link  that  client  submit¬ 
ted  . 

g iveaway(elink ,how) 

Returns  the  "elink"  to  the  current  client.  "How"  is  either 
"DUP"  or  "NODUP"  to  govern  the  disposition  of  that  link, 
int  sendrm(n, elink, how) 

Sends  a  message  to  another  resource  manager.  "n"  is  the  in¬ 
dex  of  the  resource  manager  to  send  it  to,  "elink",  "how" 
describe  the  link  to  enclose;  "elink"  =  NOLINK  means  to 
really  not  send  a  link.  Returns  0  on  success,  -1  on 
failure;  sometimes  the  caller  cares, 
c  ryout  (messag  e)  char  *message; 

General-purpose  error  indication  routine, 
gettt  (n) 

Asks  the  resource  manager  on  machine  n  for  its  terminal 
link.  Returns  either  the  link  or  -1  for  error, 
telltt ( tt , el ink) 

Tells  terminal  on  link  "tt"  that  its  new  killlink  is  "el¬ 
ink"  . 
loadf s ( ) 

Gets  a  file  manager  link  from  the  original  resource  manager 
and  uses  it  to  load  a  file  manager  on  this  machine. 
loadtt( ) 

Loads  a  terminal  driver  on  this  machine  and  gets  an  output 
link  to  it. 
loadci ( ) 

Loads  a  command  interpreter  on  this  machine;  assumes  there 
is  already  a  terminal  driver. 
initrmO ( ) 

Initialization  specific  to  the  original  resource  manager. 
Finds  the  local  time  from  Unix,  loads  the  first  file  manager 
via  a  manual  load,  loads  a  terminal  driver,  gets  an  input 
line  to  ask  for  the  configuration,  then  decodes  the  confi¬ 
guration  and  loads  the  other  machines. 
initrms(arg) 

Initialization  specific  to  non-original  resource  managers. 
Informs  the  original  resource  manager;  either  loads  a  local 
file  manager  and  terminal  or  uses  links  to  the  original 
resource  manager's  copies. 
rmstart( ) 

Handles  client  request  to  start  a  new  process.  If  FORE¬ 
GROUND,  insures  the  client  currently  has  the  ball.  Calls 
" rawstart" . 
rrstart( ) 

Handles  request  from  another  resource  manager  to  do  a  start. 
If  the  request  has  come  full  circle,  either  gives  up  or 
tries  roughly.  Calls  "rawstart". 
rawstart(parent,ttype,how) 

Tries  to  start  a  process  on  this  machine.  "Parent"  gives 
the  procid  of  the  client  who  initiated  it  all,  "ttype"  gives 


its  teletype  number/  "how"  is  GENTLY  or  ROUGHLY,  but  matters 
only  inside  "getimage".  Calls  "newproc". 
rmkill() 

Handles  client  request  to  kill  a  process.  Either  directs 
the  request  to  the  appropriate  resource  manager  or  calls 
"rawkill". 
rawki 1 1 (parent) 

Checks  if  the  parent  has  the  right  to  submit  this  kill  re¬ 
quest;  if  so,  submits  a  "kill"  service  call, 
rrlink ( ) 

The  papa  resource  manager  has  requested  a  link  for  a  third 
party.  The  third  party's  number  is  in  "contents->rmarg"; 
the  papa's  number  is  " inmess . ur code" .  Prepares  a  link  and 
sends  it. 
rrinformO 

Handies  a  new  link  given  by  a  colleague.  Establishes 
knowledge  of  that  colleague  in  the  proper  tables.  During 
recovery  actions,  the  new  information  may  disagree  with  the 
old. 

rrpass ( ) 

This  resource  manager  has  just  been  given  the  ball.  Cleans 
off  the  defunct  part  of  the  foreground  stack,  and  if  it  be¬ 
comes  empty,  sends  the  ball  elsewhere, 
rmdie ( ) 

Just  found  out  that  a  colleague  has  died.  Clears  out  en¬ 
tries  in  "rmtab". 

int  getimage(name, mode, how)  char  *name; 

Loads  a  new  core  image,  and  returns  an  index  into  imagetab. 
If  "how"  =  ROUGHLY,  will  also  try  to  make  room;  otherwise 
just  hopes  there  is  room,  or  a  usable  copy  exists.  The  mode 
is  SHARE,  REUSE,  or  VIRGIN.  Returns  -1  if  the  load  fails. 
Checks  that  the  image  is  executable,  and  will  not  use  an  ex¬ 
isting  image  with  obsolete  date.  Discovers  if  the  image  is 
an  Elmer  program. 

int  newproc ( i mag no ,arg , pa  rent , type , ttype) 

Makes  a  new  process  by  using  the  service  call  "startup". 
Returns  an  index  in  the  updated  proctab  or  -1  for  error. 
"Imagno"  is  the  index  in  imagetab  for  the  process's  image, 
"arg"  is  the  argument  to  give  the  new  process,  "parent"  is 
the  parent's  procid,  "type"  is  BACKGROUND,  FOREGROUND,  or 
DETACHED,  "ttype"  tells  which  terminal  to  use.  If  the  pro¬ 
cess  is  in  Elmer,  opens  its  object  file  in  order  to  give  the 
link  to  "startup" . 
procdie() 

Cleans  up  after  the  termination  of  a  client.  If  it  was  in 
the  foreground  and  had  the  ball,  the  ball  is  passed,  possi¬ 
bly  to  a  colleague, 
int  getindex  ( i) 

Returns  the  index  in  proctab  for  the  process  whose  procid  is 
i. 

int  makeroom() 

Makes  room  in  the  image  table  if  possible,  and  returns  the 
index  of  an  available  slot.  If  necessary,  core  images  of 
terminated  processes  are  removed.  Returns  -1  on  failure. 


drwr i te (word) 

Busy-waits  until  the  DR-11 
writes  one  word, 
int  drread() 

Busy-waits  until  the  DR-11 
turns  one  word  read. 


2.  THE  TERMINAL  DRIVER 


2.1  General 


line  to  Unix  is  ready,  then 
line  from  Unix  is  ready,  then  re- 


The  code  lies  in  " ttdriver .u" .  Processes  using  the  terminal 


driver  should  include  "ttdriver. h"  and  "filesys.h".  (The  latter 


contains  macros  used  by  both  the  terminal  driver  and  file 
manager.)  It  isn't  necessary  to  include  these  if  all  communica¬ 
tion  is  done  by  library  routines. 

The  terminal  driver  gives  its  parent  (usually  a  resource 
manager)  a  REQUEST  link  on  channel  1.  All  requests  to  open  the 
terminal  for  input  or  output  come  over  this  link  or  copies 
thereof;  also,  the  resource  manager  sends  "TOKILL"  messages  over 
this  link.  (See  Section  2.8.)  An  "input  link"  is  used  for 

"read"  and  "readline"  requests  from  the  client;  an  "output  link" 
is  used  for  "write"  requests.  Thus,  the  terminal  driver  accepts 
data  on  its  output  links  and  supplies  data  on  its  input  link.  At 
most  one  input  link  and  NUMCODES  (currently  5)  output  links  may 
be  open.  (Channel  2  is  used  for  output,  channel  3  for  input.) 
These  links  are  of  type  GIVEALL  but  not  DUPALL.  Destruction  of 
such  a  link  is  interpreted  as  a  "close"  command.  The  holder  of 
the  input  link  may  also  use  it  to  request  or  change  terminal 
modes.  (See  Section  2.4.)  — 
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Input  and  output  are  interrupt-driven;  the  routines  "ttin- 
driv"  and  "ttoutdriv"  are  invoked  at  interrupt  level  when  the 
corresponding  devices  become  ready.  Initially,  interrupts  are 
enabled  and  "handler"  calls  set  up  the  interrupt  vectors.  The 
interrupt-level  routines  send  "awaken"  messages  to  the  terminal 
driver  on  channel  10.  Interrupts  are  disabled  at  crucial  times 
by  turning  off  the  appropriate  interrupt-enable  bits  in  the  dev¬ 
ice  registers.  (Since  interrupt-level  routines  run  at  high 
priority,  this  interrupt  disabling  is  not  strictly  necessary.) 
These  interrupt-driven  routines  share  data  with  the  rest  of  the 
terminal  driver. 

2 . 2  Overview  of  input 

The  Boolean  variable  "inuse"  tells  whether  an  input  link  is 
open.  The  routine  "readmsg"  executes  a  "read"  or  "readline"  re¬ 
quest  by  calling  "getchar"  for  one  character  at  a  time.  A  Boole¬ 
an  value  is  returned  by  "getchar"  to  tell  if  the  character  ter¬ 
minates  the  current  line;  if  so,  the  character  returned  is  either 
a  newline,  control-D,  control-W,  or  null.  The  first  three  of 
these  are  appropriately  interpreted  (depending  on  whether  the 
command  is  "read"  or  "readline");  the  meaning  of  a  null  is  ex¬ 
plained  in  Section  2.8.  At  most  MSLEN  characters  may  be  read  at 
a  time;  thus  the  reply  will  fit  in  one  message. 
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2 . 3  Overview  of  output 

The  integer  arrays  "codes"  and  "bytesleft”,  each  of  size  NUM- 
CODES,  keep  track  of  output  links.  A  zero  entry  in  "codes"  indi¬ 
cates  an  unused  link;  otherwise  links  are  given  unique  codes. 
When  a  link  is  opened,  the  corresponding  entry  in  "bytesleft"  is 
set  to  zero;  when  a  write  message  is  received,  the  indicated 
length  of  the  write  is  placed  in  "bytesleft".  Subsequent  mes¬ 
sages  over  this  link  are  interpreted  as  data  to  be  written  until 
the  write  is  completed.  The  routine  "writemsg"  is  used  to  write 
each  portion;  characters  are  written  with  the  routine  "sayfull" 
(Section  2.5),  except  for  carriage  returns  and  newlines,  which 
are  written  directly  by  calling  "sayit". 


2.4  Requesting  and 


inqinq  console  modes 


The  variable  "modes"  stores  the  current  modes.  When  modes 
are  requested  or  changed,  the  routine  "showstate"  prints  out  tne 
"current"  or  "new"  modes,  respectively.  The  modes  that  can  be 
turned  on  and  off  are  ECHO,  HARD,  UPPER,  and  TABS.  A  HARD  termi¬ 
nal  cannot  backspace  its  cursor  legibly;  and  UPPER  terminal  can¬ 
not  enter  lower  case  directly,  and  a  TABS  terminal  has  hardware 
tabs . 
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2 . 5  Output  buffer  manipulation 


The  terminal  driver  uses  a  circular  output  buffer,  "ttout¬ 
buf",  of  size  TTOUTBUFSIZE  (150).  Two  variables  are  used  as  in¬ 
dices  into  "ttoutbuf":  "nxtoutch",  which  points  to  the  next 
available  place  to  put  a  character  into  the  buffer,  and  "out- 
bufp" ,  which  points  to  the  next  character  to  take  out.  When 
these  indices  are  equal,  the  buffer  is  empty. 

The  routine  "ttoutdriv"  is  called  at  interrupt  level  to  write 
a  character.  This  routine  returns  without  any  action  is  "paused" 
is  true  (Section  2.7).  If  "ttoutbuf"  is  nonempty,  the  next  char¬ 
acter  is  written  to  the  terminal,  with  a  delay  specified  in  abso¬ 
lute  location  157702  (for  debugging  and  connection  to  a  slow  Unix 
port).  After  it  is  displayed,  each  newline  is  replaced  in  the 
buffer  by  a  carriage  return  rather  than  being  removed;  by  this 
device,  carriage  returns  are  effectively  appended  to  newlines. 
Input  interrupts  are  disabled  while  messing  with  the  buffer  (in 
case  ECHO  mode  is  on) . 

The  routine  "sayit"  puts  one  character  into  "ttoutbuf".  if 
doing  so  would  fill  the  buffer,  "sayit"  waits  a  second  and  tries 
again.  The  variables  "outpos"  and  "tabplptr"  are  significant  for 
echoing  input;  "sayit"  sets  them  both  to  zero  after  a  carriage 
return  or  newline.  In  other  cases  "outpos"  is  incremented,  ex¬ 
cept  that  after  a  backspace  it  is  decremented.  While  the  buffer 
is  being  changed,  input  and  output  interrupts  are  disabled  (we 
might  be  in  ECHO  mode) . 

The  routine  "sayfull"  converts  a  character  into  a  readable 


(or  audible)  form  and  calls  "sayit".  If  UPPER  mode  is  on,  a  "!" 
is  placed  before  appropriate  characters.  Except  for  control-G 
(bell),  control  characters  are  converted  to  the  notation  ""A", 
etc.  A  rubout  is  converted  to  "*#".  The  array  "tabplace"  is 
used  to  store  the  cursor  position  just  before  each  tab,  to  allow 
backspacing  over  tabs;  "tabplptr"  is  an  index  in  "tabplace".  At 
most  TABNUM  (10)  tabs  are  stored.  If  TABS  mode  is  off,  a  tab  is 
converted  into  several  blanks,  until  "outpos"  becomes  a  multiple 
of  9  ("sayit"  increments  "outpos").  If  TABS  mode  is  on,  the  tab 
character  is  sent  directly  to  "sayit",  and  "outpos"  is  adjusted 
accordingly. 

The  routine  "sayback"  is  used  when  the  console  is  in  ECHO  but 
not  HARD  mode.  It  converts  a  given  character  into  several  back¬ 
spaces,  to  undo  the  effect  of  "sayfull",  and  calls  "sayit".  Two 
backspaces  are  required  for  control  characters,  (escaped)  rubout, 
and  the  UPPER  mode  escape  sequences,  except  that  none  are  needed 
for  bells.  Other  character  .ake  one  backspace,  except  for  tabs, 
which  require  a  sequence  of  backspaces  until  "outpos"  has  been 
decremented  (by  "sayit")  to  the  appropriate  value  found  in  "tab- 
place".  Also,  "tabplptr"  is  decremented. 

2.6  Input  buffers 

The  terminal  driver  uses  a  circular  input  buffer,  "ttinbuf", 
of  size  TTINBUF5IZE  (100),  and  a  circular  buffer,  "lineptr",  of 
size  TTLINES  (20).  The  entries  in  "lineptr"  are  indices  in 
"ttinbuf"  that  tell  where  lines  begin;  thus  TTLINES  is  the  max¬ 
imum  number  of  lookahead  lines.  There  are  two  other  variables 
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used  as  "ttinbuf"  indices:  " inbufp" ,  which  tells  the  next  avail¬ 
able  spot  to  put  a  character  into  "ttinbuf",  and  "nxtchar",  which 
tells  the  next  character  to  take  from  the  buffer.  When  "nxtchar" 
is  one  buffer  location  ahead  of  "inbufp",  the  buffer  is  full.  To 
permit  intra-line  editing,  lines  can  only  be  removed  from  the 
buffer  when  they  have  been  "terminated".  There  are  two  variables 
used  as  "lineptr"  indices:  "lastline",  which  tells  the  current 
line  being  put  into  the  buffer,  and  "nxtline",  which  tells  the 
line  being  removed.  The  buffer  is  empty  when  "lastline"  and 
"nxtline"  are  equal. 

The  routine  "ttindriv"  is  called  at  interrupt  level  to  read  a 
character.  If  "paused"  is  true,  then  only  control-Q  and 
control-S  will  have  any  effect;  all  other  characters  are  ignored 
(Section  2.7).  The  Boolean  variable  "escaping"  is  true  when  the 
next  character  is  to  have  no  special  meaning;  it  becomes  true 
when  the  escape  character  is  read  and  becomes  false  after  the 
following  character.  A  character  with  no  special  meaning  is 
placed  in  the  buffer  by  calling  "putinbuf";  if  the  buffer  is 
full,  the  character  is  discarded  and  a  bell  is  written  by  calling 
"sayit".  If  there  is  room  and  ECHO  mode  is  on,  the  character  is 
written  by  calling  "sayfull";  for  example,  an  er caped  newline 
will  echo  as  "~J",  which  allows  backspacing  over  it  later.  If 
UPPER  mode  is  on,  appropriate  translation  takes  place. 

Various  characters  cause  intra-line  editing.  A  control-C  is 
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~C<bell><newl ine> 

and  isn't  placed  in  the  input  buffer;  see  Section  2.3.  An  ERASE 
character  (rubout)  causes  the  last  character  of  the  present  line 
to  be  removed  from  the  buffer,  using  the  routine  "tkoutbuf".  If 
the  line  wasn't  empty  and  ECHO  mode  is  on,  the  character  removed 
from  the  buffer  is  echoed,  using  "sayfull"  in  HARD  mode  and  "say- 
back"  otherwise.  In  HARD  mode,  backslashes  ('\')  are  placed 
around  a  sequence  of  erased  characters;  when  erasing  begins,  a 
backslash  is  echoed  and  the  Boolean  variable  "erasing"  becomes 
true;  when  erasing  ends,  a  backslash  is  echoed  and  "erasing"  be¬ 
comes  false.  A  KILL  character  (control-X)  removes  the  entire 
present  line  from  the  buffer.  In  ECHO  mode,  "??"  is  printed.  In 
HARD  mode,  a  newline  is  also  printed.  If  we  are  in  ECHO  but  not 
HARD  mode,  the  KILL  is  treated  as  a  sequence  of  ERASES,  until  the 
current  line  is  empty;  thus  the  screen  cursor  returns  to  the 
point  where  the  line  began. 

A  line  is  "terminated"  by  an  (unescaped)  control-D,  control- 
W,  carriage  return,  or  newline.  In  the  first  two  cases,  the 
character  is  put  in  the  buffer  but  not  echoed.  In  the  last  two 
cases,  a  newline  is  put  in  the  buffer  and  echoed.  The  routine 
"termline"  updates  "lineptr";  if  this  buffer  is  full,  the  line 
termination  is  ignored.  Whenever  a  line  is  terminated,  the  vari¬ 
able  "linecount"  is  reset  to  0.  This  variable  keeps  track  of  how 
many  lines  have  been  output  to  the  terminal  since  the  last  time 
the  user  entered  a  line.  An  "awaken"  call  is  made  in  case 
"getchar"  (described  next)  was  waiting  for  the  buffer  to  become 
nonempty.  If  this  awaken  is  received  in  the  terminal  driver's 
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main  loop,  it  is  properly  ignored. 

The  routine  "getchar"  takes  a  character  out  o£  "ttinbuf"  and 
also  returns  a  Boolean  value  to  tell  if  a  line  has  just  been  com¬ 
pleted.  (Thus,  for  example,  escaped  and  unescaped  newlines  are 
distinguished.)  When  a  line  is  completed,  "lineptr"  is  ap¬ 
propriately  updated.  If  the  buffer  was  empty,  "getchar"  waits 
for  a  message  on  channel  10  to  indicate  that  the  interrupt  level 
routine  "ttindriv"  has  put  a  line  into  the  buffer.  (This  message 
might  also  indicate  a  control-C;  see  Section  2.9.)  Input  inter¬ 
rupts  are  disabled  when  the  buffer  is  in  an  awkward  state. 

2.7  Pause  control 

Pause  control  uses  the  commands  control-S  and  control-Q.  In¬ 
itially,  "scroll"  is  false.  If  a  control-S  is  typed  in  this 
state,  "scroll"  is  set  to  true  and  "pause"  is  also  set  to  true. 
The  effect  is  that  output  pauses  until  released,  and  it  will  con¬ 
tinue  to  periodically  pause  every  SCROLLLEN  (13)  lines.  When  the 
terminal  is  paused,  a  control-S  will  cause  it  to  be  released  for 
the  next  18  lines,  but  a  control-Q  will  release  it  and  turn  off 
scroll  mode,  so  it  will  not  stop  again.  Control-Q  can  also  be 
used  to  turn  off  rcroll  mode  even  if  the  terminal  is  not  current¬ 
ly  paused. 
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2 . 8  Control-C  actions 

The  variable  "killlink"  contains  the  lifeline  along  which  a 
kill  should  be  performed  upon  receipt  of  a  control-C.  Initially, 
"killlink"  is  set  to  -1  to  indicate  the  absence  of  such  a  life¬ 
line.  The  resource  manager  encloses  such  a  lifeline  in  a  "T0- 
KILL"  message  to  the  terminal  driver,  received  on  channel  1. 
When  a  lifeline  is  received,  the  Boolean  variable  "ctrlC"  is  set 
to  false. 

The  interrupt  level  routine  "ttindriv"  notices  when  a 
control-C  is  typed.  If  "ctrlC"  was  false  and  "killlink"  was 
non-negative,  "ctrlC"  is  set  to  true  and  a  message  is  sent  to  the 
terminal  driver  on  channel  10  with  an  "awaken"  call.  While 
"ctrlC"  is  true,  all  messages  received  on  channels  2  and  3  are 
ignored  and  any  links  enclosed  in  such  messages  are  destroyed. 

The  message  on  channel  10  is  received  either  in  the  terminal 
driver's  main  loop  or  in  the  routine  "getchar",  which  was  waiting 
for  a  non-empty  input  buffer.  In  either  case,  the  routine 
"chkctrlC"  performs  the  kill  if  "ctrlC"  is  true,  flushes  all  out¬ 
standing  messages  on  channel  10,  and  reinitializes  "killlink"  and 
the  input  and  output  buffers.  Any  lame-duck  messages  on  channels 
2  and  3  will  be  ignored  because  "ctrlC"  is  still  true.  A  Boolean 
value  is  returned  by  "chkctrlC"  so  that  "getchar"  knows  that  the 
message  was  a  control-C  indicator  rather  than  a  non-empty  buffer 
indicator. 

After  a  kill  is  performed,  control  must  return  to  the  main 
loop.  If  the  message  had  been  received  inside  "getchar",  the 
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value  returned  is  the  null  character  as  a  line  terminator.  This 
special  case  is  recognized  by  "readmsg"  which  then  returns  to  the 
main  loop  without  completing  its  read  request;  the  link  enclosed 
with  that  request  is  destroyed. 

2.9  Pausing  and  continuing 


When  an  unescaped  control-S  is  read,  the  variable  "paused"  is 
set  to  true.  When  an  unescaped  control-Q  is  read,  that  variable 
is  reset  to  false  and  the  output  interrupt  enabling  is  toggled  to 
restart  the  output-interrupt  driven  routine  "ttoutdriv".  That 
routine  returns  without  any  action  if  "paused"  is  true. 

Files 

ttdriver.u,  ttdriver.h,  filesys.h 

Data  Structures 

char  ttinbuf (TTINBUFSIZE] 

Circular  input  buffer  filled  by  ttindriv,  emptied  by 
readmsg . 

int  1  inept r [TTLINES] 

Circular  buffer  of  ttinbuf  indices  that  point  to  beginnings 
of  lines, 
char  escaping 

Boolean;  set  by  ESCAPE,  reset  by  next  character, 
char  erasing 

Boolean;  true  during  a  sequence  of  ERASES;  only  used  in  hard 
copy  mode, 
char  modes 

Bits  used:  ECHO,  TABS,  HARD,  UPPER, 
int  tabplace [TABNUM] , tabplptr 

Remembers  where  tabs  were, 
char  ttoutbuf [TTOUTBUFSIZE] 

Circular  output  buffer.  Filled  by  sayit,  emptied  by  ttout¬ 
driv. 

int  codes [NUMCODES] 

Currently  active  output  lines, 
int  bytesl ef t [NUMCODES] 

Used  to  keep  track  of  pieces  of  different  write  messages, 
int  killlink 

Tells  the  ttdriver  whom  to  kill  on  ~C. 

Procedures 
main(dev) 

Initializes  tables,  provides  parent 


with  a  request  link, 


prepares  to  use  terminal  whose  device  register  is  at  "dev". 
Executes  a  loop  that  receives  and  dispatches  client  re¬ 
quests  . 

readmsg ( len ,how) 

Reads  "len"  characters,  using  routine  "getchar".  At  most 
MSLEN  characters  are  read.  Reading  terminates  if  a 
control-D  is  read.  In  the  case  that  "how"  is  READLINE,  any 
line  terminator  (control-W  or  <cr>  or  <lf>)  terminates  read¬ 
ing  . 

char  getchar(ch)  char  *ch; 

Gets  a  character  from  the  input  buffer  and  returns  it  in 
"ch".  The  returned  value  is  Boolean:  TRUE  means  the  char¬ 
acter  returned  ends  a  line, 
ttindr iv ( ) 

Called  by  "ttinint"  when  a  character  is  ready;  runs  at  in¬ 
terrupt  level.  Reading  a  control-C  causes  an  "awaken"  ser¬ 
vice  call,  ERASE  or  KILL  cause  intra-line  editing.  Line 
termination  is  caused  by  control-D,  control-W,  <cr>  (con¬ 
verted  into  <lf>)  and  <lf>.  Termination  causes  an  "awaken" 
service  call.  The  input  buffer  is  updated,  and  the  input  is 
properly  echoed, 
char  putinbuf(ch)  char  ch; 

Puts  the  given  character  into  the  input  buffer.  It  returns 
TRUE  only  if  there  was  room  in  the  buffer, 
char  tkoutbuf(ch)  char  *ch; 

Removes  last  character  of  current  line  from  buffer.  Returns 
TRUE  only  if  something  was  there.  The  character  is  returned 
in  "ch". 
termline ( ) 

Called  at  interrupt  level  to  cause  an  "awaken"  and  to  reset 
buffer  pointers, 
resetbuf ( ) 

Removes  the  current  line  by  resetting  an  input  buffer 
pointer, 
writemsg () 

Decodes  a  write  message  from  a  client.  If  it  is  the  header 
of  several  packets  with  data,  variables  are  initialized  to 
receive  the  data.  If  data  have  arrived,  they  are  placed  in 
the  output  buffer  by  "sayit"  and  "sayfull". 
char  chkctrlCO 

If  a  control-C  has  been  received,  all  awaken  messages  are 
flushed,  a  service  call  "kill"  is  performed  along  the 
killlink,  and  the  routine  returns  "TRUE". 
closeinput( ) 

Reduces  the  count  of  input  lines  in  use. 
closeoutput ( ) 

Resets  the  appropriate  output  line  information, 
openl ine ( how) 

Handles  a  client  request  for  a  new  input  or  output  line,  as 
described  by  "how".  Appropriate  variables  are  initialized. 
reply(retcode,size)  char  retcode; 

Reports  "retcode"  to  the  current  client.  The  "size"  parame¬ 
ter  tells  how  much  of  the  standard  message  buffer  has  been 
filled  with  other  useful  information  that  the  client  must 


also  receive.  The  reply  code  is  put  in  the  first  byte. 
showstate(when)  char  *when; 

Prints  the  current  modes  on  the  terminal  with  an  introducto¬ 
ry  message  determined  by  "when". 
sayit(ch)  char  ch; 

Puts  one  character  in  the  output  buffer  and  adjusts  position 
variable  "outpos"  accoringly.  This  routine  is  used  both  for 
input  and  output  echoing. 
sayfull(ch)  char  ch; 

Uses  "sayit"  to  provide  a  readable  form  for  any  character 
according  to  the  current  modes. 
sayback(ch)  char  ch; 

Prints  as  many  backspaces  as  necessary  to  obliterate  the 
full  printing  of  character  "ch"  under  current  modes, 
ttoutdriv  () 

Called  at  interrupt  level.  Waits  a  standard  delay  to  slow 
down  the  terminal  and  then  sends  one  character  from  the  out¬ 
put  buffer  to  the  terminal.  Line  feeds  are  followed  by  car¬ 
riage  returns.  Returns  with  no  action  if  "paused"  is  true. 
ttyflush( ) 

Removes  any  character  waiting  in  the  terminal  input  buffer. 
influshO 

Clears  out  the  entire  input  buffer, 
outflush  ( ) 

Clears  out  the  entire  output  buffer. 


3.  THE  COMMAND  INTERPRETER 


3.1  General 


The  command  interpreter  is  a  FOREGROUND  process  that  executes 
commands  typed  at  its  console.  The  command  interpreter  may  start 
another  FOREGROUND  process,  which  communicates  with  the  command 
interpreter  to  get  command-line  arguments. 

The  command  interpreter  is  compiled  by  executing  "makecom- 
int",  which  compiles  and  links  together  three  files  to  produce 
"comint".  The  three  source  files  are;  "comint. u",  which  handles 
command  line  parsing,  "comutil.u",  which  contains  routines  to  ex¬ 
ecute  most  commands,  and  "comrun.u",  which  executes  the  "run" 


command . 


3.2  Initialization 


The  command  interpreter  acquires  a  file  manager  link,  a  ter¬ 
minal  driver  link,  and  terminal  input  and  output  links  from  the 
resource  manager.  The  terminal  driver  link  is  only  used  to  re¬ 
quest  or  change  console  modes;  initially,  the  command  interpreter 
sets  these  to  "ECHO". 

3 . 3  Command  line  parsing 

A  line  is  input  with  a  "readline"  call  and  converted  into  a 
null-terminated  string.  The  line  is  truncated  to  LINEMAX-1  char¬ 
acters  (LINEMAX  is  200) . 

The  routine  "findargs"  scans  the  input  line,  separating  it 
into  arguments.  Sequences  of  characters  enclosed  in  quotes  are 
left  alone,  with  the  quotes  deleted.  The  Boolean  variable  "quot¬ 
ed"  is  true  during  this  process.  Two  consecutive  quotes  encoun¬ 
tered  while  "quoted"  is  true  are  converted  into  one  quote  and  do 
not  turn  off  "quoted".  When  "quoted"  is  false,  a  blank  or  tab  is 
converted  into  a  null  to  terminate  an  argument.  Any  immediately 
following  blanks  or  tabs  are  ignored;  the  Boolean  variable  "spac¬ 
ing"  is  true  during  this  process.  At  the  beginning  of  the  line, 
"quoted"  is  false  and  "spacing"  is  true.  The  array  "argvec"  re¬ 
turns  pointers  to  the  argument  locations;  an  entry  is  made  in 
"argvec"  when  "spacing"  changes  from  true  to  false.  The  variable 
"argcount"  tells  the  number  of  arguments;  it  is  incremented  when 
"spacing"  changes  from  false  to  true  or  at  the  end  of  the  line  if 
"spacing"  is  false. 
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If  there  are  no  arguments,  no  action  is  taken.  Only  the 
first  MAXARGS  (10)  arguments  are  used;  the  rest  are  ignored. 

The  routine  "lookup"  searches  a  list  of  character  strings  to 
find  those  whose  initial  segments  match  a  given  string  argument. 
The  list  format  is  an  alphabetically  sorted  array  of  character 
strings  alternating  with  corresponding  codes  (integers),  and  with 
pseudodata  sentinels  at  each  end.  Two  pointers  into  the  table, 
"low"  and  "high",  start  at  opposite  ends  and  move  toward  each 
other  as  the  argument  is  scanned.  As  each  character  in  the  argu¬ 
ment  is  examined,  "low"  moves  up  the  table  so  long  as  this  char¬ 
acter  is  larger  than  the  corresponding  ones  in  the  table  at  which 
"low"  points;  "high"  does  the  reverse.  The  process  stops  if  the 
argument  is  exhausted  or  if  "high"  and  "low"  pass  each  other.  In 
the  latter  case,  there  is  no  match.  In  the  former,  there  are  one 
or  more  matches;  "low"  and  "high"  are  equal  or  not  accordingly. 

The  first  argument  on  the  command  line  is  deciphered  as  a 
command  by  calling  "lookup"  with  the  table  "commands".  If  there 
is  a  unique  match,  the  appropriate  action  is  taken. 

3.4  Command  execution 

The  "background"  command  starts  a  process  with  modes  "BACK¬ 
GROUND"  and  "REUSE"  and  passes  the  given  argument  as  an  integer. 
An  answer  is  received  from  the  resource  manager  and  the  new 
process's  process  identifier  is  printed.  The  new  process  is  not 
given  a  link  to  the  command  interpreter. 

The  "copy"  and  "type"  commands  are  translated  into  "run  copy- 
file"  commands.  The  program  "copyfile"  is  an  independent  program 
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that  copies  one  file  to  another,  with  the  terminal  as  the  default 
for  the  second  file. 

The  "directory"  command  executes  a  "stat"  on  the  indicated 
file  and  prints  selected  portions  of  the  information  returned. 

The  "make"  command  "creates"  the  indicated  file  and  performs 
"read"  commands  for  IOBUFSIZE  (100)  bytes  at  a  time  from  the  ter¬ 
minal.  After  each  "read",  a  "write"  is  done  to  the  file;  final¬ 
ly,  the  file  is  closed.  The  end  is  indicated  by  a  "read"  return¬ 
ing  less  than  512  bytes;  thus,  if  the  input  has  exactly  512  bytes 
(or  a  multiple  thereof) ,  it  must  be  terminated  by  an  extra 
control-D . 

The  "run"  command  starts  a  process  with  modes  "FOREGROUND" 
and  "REUSE"  and  passes  as  an  argument  the  number  of  command  line 
arguments.  The  resource  manager  is  given  a  REQUEST  link  for  the 
child  and  the  terminal  input  link  is  closed  so  that  the  child  may 
open  it.  Command  line  arguments  are  sent  to  the  child  when  re¬ 
quested.  The  command  interpreter  assumes  that  the  child  has  ter¬ 
minated  when  the  REQUEST  link  is  destroyed;  it  then  reads  the 
next  console  command.  If  the  start  fails,  the  command  inter¬ 
preter  waits  for  the  REQUEST  link  to  be  destroyed  before  continu¬ 
ing  . 

The  "set"  or  "SET"  command  first  requests  the  current  modes, 
which  causes  the  terminal  driver  to  print  them.  The  command  line 
arguments  are  then  deciphered  individually;  a  "-"  prefix  is 
remembered  with  the  Boolean  variable  "notflag"  and  the  command 
itself  is  decoded  by  calling  "lookup"  (Section  3.3)  with  the 
table  "modetab".  When  a  mode  is  recognized,  the  current  mode 


specification  is  altered  accordingly.  Finally,  the  modes  are 
changed,  and  the  terminal  driver  prints  the  new  modes. 

The  "time"  command,  if  given  an  argument,  sets  the  time  by 
calling  "datetol"  and  "setdate".  The  argument  is  only  checked  to 
see  that  it  has  ten  characters,  and  zeroes  are  added  for  the 
number  of  seconds.  With  or  without  an  argument,  "time"  finally 
prints  the  current  time,  which  is  done  by  calling  "date"  and 
" ltodate" . 

Files 

comint. u,  comutil.u,  comrun.u,  comint. h 

Data  Structures 
char  * commands [ ] 

Holds  the  known  commands  paired  with  an  internal  distin¬ 
guishing  code.  The  array  must  be  in  alphabetical  order, 
char  *argvec [MAXARGS] 

The  arguments  to  started  processes  are  stored  here, 
char  *modetab[] 

A  table  of  terminal  mode  names  to  be  used  with  the  routine 
" lookup" . 

Procedures 
main ( ) 

Initializes  tables,  acquires  file  manager  and  terminal 
driver  links,  then  executes  a  loop  that  accepts  commands 
from  the  terminal  and  dispatches  them, 
int  f indargs(line)  char  *line; 

"Line"  is  null-terminated  (without  final  newline)  and 
doesn't  contain  any  embedded  nulls.  Puts  pointers  to  the 
beginnings  of  arguments  into  the  array  "argvec"  and  the 
count  of  how  many  were  found  into  the  global  "argcount". 
Terminates  the  arguments  with  nulls.  Spaces  and  tabs  are 
considered  delimeters  unless  they  appear  in  quotes  (").  Two 
consecutive  quotes  inside  quotes  are  considered  one  quote. 
Other  quotes  are  stripped. 

lookup(str , table , tables ize , resul t)  char  *str  ,  ** table;  i nt 

♦result; 

Looks  up  character  string  "str"  in  "table".  Sets  rasult[0] 
and  resultfl]  such  that  table [ r esul t [ 0 ] 1 , 

tablet resul t [ 0 ] +1] ,  ...  ,  table [ resul t [ m  all  have  "str"  as 
an  initial  segment.  If  r esul t [ 0 ] >r esul t [ 1 ] ,  there  was  no 
match.  Assumes  that  table[0],  table[2],  ...  are  strings 
kept  sorted  in  alphabetical  order,  and  table[l],  table[3], 
...  are  other  data  to  be  ignored  in  lookup.  Assumes  further 
that  table[0]  is  guaranteed  to  compare  low  and 
table[tablesize-2]  is  guaranteed  to  compare  high  with  "str". 


int  i ntype ( fname)  char  *£name; 

Handles  a  "make"  command.  Accepts  input  from  the  terminal, 
creates  a  new  file  with  name  "fname"  and  puts  all  input  on 
that  file.  Returns  0  on  success,  -1  on  failure. 
dir(fname)  char  *fname; 

Handles  a  "dir"  command.  Uses  the  file  manager  to  read  the 
directory  information  from  a  file,  and  prints  it  on  the  ter¬ 
minal  . 
setmodes ( ) 

Handles  a  "set"  command.  Uses  "lookup"  to  find  what  modes 
are  requested,  and  communicates  with  the  terminal  a.  /er  to 
establish  those  modes, 
pr inttime( ) 

Handies  a  "time"  command.  Finds  the  current  time  w  the 

service  call  "date"  and  the  library  routine  "ltodat  •  then 
prints  the  result. 
settime(s)  char  *s; 

Handles  a  "time"  command  with  an  argument.  Uses  the  library 
routine  "datetol"  and  the  service  call  "setdate"  to  change 
the  kernel's  date, 
int  runback(fname,arg)  char  *fname; 

Attempts  to  run  the  file  "fname"  as  a  background  process, 
handling  the  "back"  command.  It  returns  the  process  id  of 
the  new  process  or  -1  on  failure. 
killback(procid) 

Sends  a  note  to  the  resource  manager  to  kill  the  process 
whose  identifier  is  "procid".  Handles  the  "kill"  command, 
int  run(fname,argc,argO)  char  *fname; 

Handles  the  "run"  command.  Attempts  to  load  and  run  the  ex¬ 
ecutable  file  named  by  "file".  Returns  0  on  success.  If 
"argc"  >  0  then  uses  "argO"  and  following  arguments  to 
satisfy  requests  for  arguments  instead  of  arguments  from  the 
command  line.  Executes  a  loop  that  waits  for  requests  from 
the  child  for  arguments  until  the  child  terminates. 


4.  THE  FILE  MANAGER 


4.1  General 


The  file  manager  forwards  requests  from  other  processes  to 
the  demon  running  on  the  PDP-11/40  where  they  are  implemented 
under  Unix.  (See  Section  5  for  details  on  the  demon.) 

The  code  lies  in  "filesys.u";  all  processes  using  the  file 
manager  should  include  "filesys.h".  (It  isn't  necessary  to  in- 


if  all  communication  is  done  by  library 


elude  "filesys.h" 
routines . ) 

The  word-parallel  line  used  to  communicate  with  the  PDP-11/40 
uses  three  registers  at  location  DR11.40  (octal  167770).  All 
reading  and  writing  use  busy  waits.  More  details  are  found  in 
the  file  "io.h". 

The  file  manager  initially  gives  a  REQUEST  link  to  its  parent 
(usually  the  resource  manager)  with  channel  1.  All  "open", 
"create",  "alias",  "unlink",  and  "stat"  requests  come  over  this 
link  or  copies  thereof.  When  a  file  is  "opened”  or  "created",  a 
new  link  with  channel  2  is  enclosed  with  the  reply.  This  link 
will  be  used  for  "read",  "readline",  "write",  and  "seek"  re¬ 
quests;  its  destruction  indicates  a  "close"  request. 

4 . 2  Execution  of  requests 

The  file  manager  executes  most  requests  by  receiving  a  mes¬ 
sage  from  the  client,  writing  a  request  over  the  word-parallel 
line  to  the  PDP-11/40,  reading  the  reply  from  the  word-parallel 
line,  and  sending  it  to  the  client. 

Requests  on  channel  1  contain  file  names.  These  are  communi¬ 
cated  over  the  word-parallel  line  by  first  writing  the  length  and 
then  the  name.  The  routine  "rawoper"  is  used  by  "open", 
"create",  "alias",  and  "unlink"  to  send  a  request  to  the  demon 
and  receive  a  one-word  reply  to  be  forwarded  to  the  client.  In 
the  cases  of  successful  "open"  or  "create"  calls,  a  new  link  with 
channel  2  is  enclosed  with  the  reply;  the  code  for  this  link  is 
the  same  as  the  value  returned  to  the  client  (a  Unix  file 
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descriptor).  This  new  link  is  of  type  GIVEALL  but  not  DUPALL. 
The  routine  "rawstat"  is  used  by  "stat";  it  reads  38  bytes  from 
the  demon.  The  first  word  tells  whether  the  stat  was  successful; 
either  0  or  36  bytes  are  forwarded  to  the  client  accordingly. 

A  request  on  channel  2  refers  to  an  open  file;  the  file 
descriptor  for  this  file  is  the  code  of  the  link.  The  routine 
"rawread"  forwards  a  "read"  or  "readline"  request  to  the  demon 
and  reads  the  reply.  An  integer  telling  how  many  bytes  were  ac¬ 
tually  read  comes  first,  followed  by  the  bytes  themselves.  The 
bytes  read  are  then  forwarded  to  the  client.  No  more  than  MSLEN 
bytes  should  be  read  at  a  time,  so  that  one  message  suffices  for 
the  reply.  The  routine  "rawseek"  similarly  treats  "seek"  re¬ 
quests,  except  that  only  one  word  is  read  from  the  demon,  and 
then  forwarded  to  the  client. 

When  a  file  is  opened  for  writing,  the  corresponding  entry  in 
the  array  "bytesleft"  is  set  to  zero.  ("Bytesleft"  is  indexed  by 
file  descriptors.)  When  a  "write"  request  is  received  on  channel 
2,  the  indicated  length  for  the-write  is  inserted  in  "bytesleft". 
Further  messages  on  the  same  link  are  taken  as  data  to  be  written 
(MSLEN  bytes  at  a  time)  until  the  write  is  completed.  As  each 
portion  is  received,  the  routine  "rawwrite"  sends  it  to  the  demon 
and  waits  for  an  acknowledgment  before  proceeding.  No  reply  is 
given  to  the.  client. 

When  a  link  on  channel  2  is  destroyed,  a  "close"  message  is 
sent  to  the  demon.  No  response  is  read  from  the  demon,  and  no 
reply  is  made  to  the  client. 


Files 


’filesys.h,  filesys.u,  io.h 
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Procedures  ' 

main() 

Initializes  tables,  then  executes  a  loop  awaiting  client  re¬ 
quests  and  dispatching  them, 
r awstat { name, replyl ink)  char  *name; 

Handles  a  "stat"  request.  The  file  "name"  is  sent  to  the 
demon.  Its  answer  is  returned  to  the  client;  failure  is 
marked  by  an  empty  message, 
int  rawopcr ( f ile ,mode ,how)  char  *file; 

The  argument  "how"  is  OPEN,  ALIAS,  UNLINK,  or  CREAT.  A  mes¬ 
sage  is  sent  to  the  demon  to  do  the  appropriate  action  to 
"file".  The  "mode"  is  the  same  as  Unix  mode  for  files.  The 
file  descriptor  given  by  the  demon  is  returned, 
rawread ( rwfd ,buf , bytes)  char  *buf; 

Reads  "bytes"  bytes  from  the  file  whose  descriptor  is  "rwfd" 
into  "buf",  which  must  be  on  a  word  boundary  (even), 
readdr (buf ,rwlen)  int  *buf; 

Reads  ceil ing ( rwlen/2)  words  from  the  DR  line  to  Unix  into 
"buf",  which  must  be  word-aligned  (even), 
writedr (buf ,rwlen)  int  *buf; 

Writes  ceil ing ( rwlen/2)  words  to  the  DR  line  to  Unix  from 
"buf",  which  must  be  word-aligned  (even), 
r awe lose ( us r code) 

Handles  a  client  "close"  request.  Sends  a  note  to  the  demon 
to  close  the  file  whose  descriptor  is  "usercode". 
rawwrite( rwfd , buf , bytes)  char  *buf;  •• 

Handles  a  "write"  request  from  a  client.  Gives  the  demon  V] 

data  from  "buf"  of  length  "bytes"  to  be  placed  in  file 
" rwfd" . 

int  rawseek (skfd , offset , mode) 

Handles  a  "seek"  request  from  a  client.  Sends  a  note  to  the 
demon  to  do  the  given  seek  ("offset"  and  "mode"  mean  the 
same  as  in  Unix)  to  file  identified  as  "skfd".  Success  re¬ 
turns  0;  failure  -1. 


5.  THE  DEMON 


5 . 1  General 

The  "demon"  is  a  program  that  runs  on  the  PDP-11/40  under 
Unix.  Its  code  is  in  "demon. c".  Roscoe  processes  that  communi¬ 
cate  with  it  must  include  "demon. h"  . 

For  each  LSI  there  is  an  associated  demon.  This  demon  reads 
from  a  word-parallel  line  connected  to  that  LSI;  the  Unix  names 
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for  these  lines  are  "/dev/drx",  where  x  =  0,...,4.  Commands  are 
translated  into  corresponding  Unix  system  calls  and  appropriate 
responses  are  written  to  the  word-parallel  line.  All  user  pro¬ 
cess  communication  at  the  LSI  side  of  the  fence  is  done  by  the 
file  manager  (Section  4). 

Each  message  sent  in  either  direction  on  the  word-parallel 
line  is  preceded  by  at  least  one  header  word  of  "NONSENSE"  (octal 
125252).  After  the  header  word(s),  the  next  three  words  of  a 
message  to  the  demon  have  the  following  structure: 

struct  { 

int  command , code , length; 

} 

The  number  of  bytes  remaining  in  the  message  is  "length".  These 
remaining  bytes  are  usually  a  file  name,  in  which  case  they  will 
subsequently  be  read  into  the  character  array  "fname",  of  size 
MAXNAME  (40).  The  value  returned  over  the  word-parallel  line  is 
usually  a  single  word  (after  a  word  of  "NONSENSE"). 

The  demon  sits  in  an  infinite  loop  awaiting  messages.  When  a 
message  is  received,  the  appropriate  action  is  taken,  as 
described  in  further  subsections.  The  routine  "getstr"  is  used 
to  read  from  the  word-parallel  line;  it  rounds  the  number  of 
bytes  up  to  an  even  integer  and  watches  out  for  errors  due  to 
terminal  interrupts.  The  routine  "signal"  is  called  to  catch 
terminal  interrupts,  which  otherwise  plague  all  Unix  processes 
started  at  a  given  terminal. 
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5 . 2  DALIAS  command 

The  string  "fname"  is  split  into  two  pieces  to  become  the  two 
arguments  for  the  Unix  call  "link".  The  length  of  the  first  sub¬ 
string  is  "code".  The  effect  of  "link"  is  to  make  its  second  ar¬ 
gument  an  alias  for  the  first  one.  The  value  returned  by  "link" 
is  passed  on. 

5 . 3  DCLOSE  command 

file  descriptor  "code"  is  closed  (Unix  call  "close").  No 
message  is  returned. 

5.4  DCREAT  command 

The  file  "fname"  is  created  (Unix  call  "creat")  with  mode 
"code".  The  value  returned  by  "creat"  is  passed  on. 

5 . 5  DOPEN  command 

The  file  "fname"  is  opened  (Unix  call  "open")  with  mode 
"code".  The  value  returned  by  "open"  is  passed  on. 

5 . 5  DREAD  command 

For  this  command,  "length"  tells  the  number  of  bytes  to  read 
from  file  descriptor  "code",  using  the  Unix  call  "read".  This 
length  is  truncated  to  "BUFLEN"  (512).  The  first  word  of  the  re¬ 
turn  message  is  "code".  The  second  word  is  the  value  returned  by 
"read",  which  tells  the  number  of  bytes  actually  read.  The  bytes 
read  are  written  next;  if  "read"  returns  -1  (error),  nothing  else 
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is  written.  A  garbage  byte  will  exist  at  the  end  if  the  number 
of  bytes  actually  read  was  odd. 

5.7  DREADLINE  command 

This  command  is  identical  to  "DREAD",  except  that  the  Unix 
call  "read"  is  used  for  one  byte  at  a  time.  If  a  "newline"  char¬ 
acter  is  encountered,  it  is  considered  part  of  the  returned  text, 
and  reading  stops. 

5.3  DSEEK  command 

A  Unix  call  "seek"  is  performed  on  file  descriptor  "code". 
The  offset  for  the  "seek"  call  is  "length";  the  mode  for  the 
"seek"  call  is  the  next  word  read  from  the  word-parallel  line. 
The  value  returned  by  "seek"  is  passed  on. 

5 . 9  DSTAT  command 

A  Unix  call  "stat"  is  performed  on  file  "fname".  The  first 
word  of  the  return  message  is  -1  for  failure,  36  for  success.  In 
either  case,  36  additional  bytes  are  written;  if  the  "stat"  suc¬ 
ceeded,  these  bytes  are  the  desired  information. 

5.10  DTIME  command 

A  Unik  call  "time"  is  performed  to  return  a  double  word.  The 
first  word  of  the  return  message  is  0;  the  next  two  words  are  the 
result  of  the  "time"  call.  This  command  is  only  used  by  the 
resource  manager  during  Roscoe  initialization. 
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5.11  DUNLINK  command 

A  Unix  call  "unlink"  is  performed  on  the  file  "fname".  The 
value  returned  by  "unlink"  is  passed  on. 

5.12  DWRITE  command 

The  rest  of  the  incoming  message  is  read  into  "writebuf";  the 
length  of  this  text  is  "length",  truncated  to  BUFLEN  (512)  bytes. 
This  text  is  then  written  to  file  descriptor  "code",  using  the 
Unix  call  "write".  The  value  returned  is  "code"  if  "write"  re¬ 
turned  success;  otherwise,  the  value  returned  is  "code"  times 
minus  one. 


6.  LIBRARY  ROUTINES 


All  the  files  in  this  section  are 
"/usr/network/roscoe/library" .  The  object 
"libr.a". 


in  the  directory 
code  is  archived  in 


6 . 1  File  manag er  routines 

These  routines  communicate  with  the  file  manager  (Section  4). 

The  routine  "opcreat"  is  used  by  "open",  "create",  "alias", 
"unlink",  and  "stat"  (the  sources  reside  in  "opcr.u",  "open.u", 
"create. u",  "alias. u",  "unlnk.u",  and  "stat.u",  respectively)  to 
send  a  command  and  file  name  to  the  file  manager  over  the  given 
file  manager  link.  Another  argument,  "mode",  has  various  mean¬ 
ings  for  "open",  "create",  and  "alias"  calls.  In  the  case  of 
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"stat",  an  additional  argument  represents  a  REPLY  link  that  is 
passed  to  the  file  manager.  The  routine  "stat"  receives  a 
response  over  this  link  and  copies  the  information  into  the 
designated  buffer.  In  the  other  four  cases,  "opcreat"  waits  for 
a  response  from  the  file  manager  and  gives  a  return  value  accord¬ 
ingly  (this  value  is  a  link  number  after  a  successful  "open"  or 
"create").  The  routine  "alias"  concatenates  its  two  file  name 
arguments  before  calling  "opcreat";  "mode"  is  then  the  length  of 
the  first  name,  to  eventually  be  decoded  by  the  demon. 

The  routine  "close"  (in  "close. u")  is  synonymous  with  "des¬ 
troy"  . 

The  routine  "someread"  (in  "somerd.u")  is  used  by  both  "read" 
and  "readline"  (in  "read.u"  and  "rdln.u",  respectively).  The  ap- 
' T  propriate  command  is  sent  over  the  given  link  (to  either  the  file 
manager  or  terminal  driver).  All  reads  are  split  up  into  indivi¬ 
dual  requests  for  MSLEN  bytes  at  a  time.  The  responses  for  the 
portions  of  the  read  are  copied  into  the  given  buffer. 

The  routine  "seek"  (in  "seek.u")  sends  an  appropriate  message 
to  the  file  manager  over  the  given  link,  and  waits  for  a  reply. 
Success  is  reported  by  a  zero  in  the  first  word  of  the  reply  mes¬ 
sage. 

The  routine  "write"  (in  "write. u")  sends  the  appropriate 
header  message  over  the  given  link  (to  the  file  manager  or  termi¬ 
nal  driver)  and  then  sends  the  data  in  subsequent  messages  MSLEN 
bytes  at  a  time.  No  reply  is  awaited,  and  no  value  is  returned. 
The  routine  "print"  (in  "print. u")  is  similar  to  the  Unix  printf 
routine.  It  edits  the  output  string  and  calls  "write".  A  linear 


buffer,  "prbuf",  of  size  PRINTBUFSIZE  (100),  is  used.  The  format 
string  is  scanned  and  the  routines  "printint",  "printlong", 
"printoct",  and  "printstr"  are  called  to  handle  the  conversions 
for  "%d",  "%w"  ,  "%o" ,  and  "%s"  format  items,  respectively.  Both 
"printint"  and  "printlong"  check  for  the  sign,  take  absolute 
value,  and  use  division  by  10  (recursively),  although  "printlong" 
uses  long  arithmetic.  The  routine  "printoct"  always  produces  six 
characters;  it  first  checks  the  sign  bit,  and  then  inspects  three 
bits  at  a  time  with  an  appropriate  shift.  In  all  cases  "printch" 
is  used  to  put  characters  into  "prbuf”  and  to  call  "write"  when 
the  buffer  is  full.  The  buffer  is  also  flushed  at  the  end. 

6 . 2  Resource  manag er  request  routines 

The  routines  "fsline"  and  "parline"  (in  "fsline.u"  and 
" par 1 in. u" )  ask  the  resource  manager  (Section  1)  for  the  ap¬ 
propriate  link  and  return  the  enclosure.  The  routines  "inline" 
and  "outline"  (in  "inline. u"  and  "outlin.u")  first  ask  the 
resource  manager  for  a  terminal  link,  then  ask  the  terminal 
driver  for  the  appropriate  line,  and  finally  return  the  enclo¬ 
sure  . 

The  routine  "fork"  (in  "fork.u")  sends  a  start  message  to  the 
resource  manager,  conveying  the  file  name,  argument,  and  mode, 
but  always  specifying  ANSWER.  A  link  is  given  to  the  child  of 
type  REQUEST,  GIVEALL,  and  TELLDEST.  The  first  word  of  the  reply 
message  is  returned;  in  particular,  this  word  is  the  process 
identifier  for  a  BACKGROUND  child.  If  the  start  failed,  "fork" 
waits  for  the  given  link  to  be  destroyed. 
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The  routine  "killoff"  (in  "killof.u")  conveys  the  kill  re¬ 
quest  to  the  resource  manager,  gets  a  reply,  and  returns  the 
first  word  of  the  reply  message. 

6 . 3  Roscoe  service  calls 

The  Roscoe  service  call  interface  is  the  assembler  file 
"lib.s".  For  each  call,  an  appropriate  magic  number  is  placed  in 
register  1  and  a  jump  is  made  to  the  Roscoe  entry  point  "sys" 
(octal  location  1002).  Arguments  are  left  on  the  stack;  the  ker¬ 
nel  takes  it  from  there. 

6 . 4  Miscellaneous 

The  routine  "atoi"  (in  "atoi.u")  converts  a  string  into  an 
integer. 

The  file  "call.u"  contains  "call"  and  "recall".  The  routine 
"call"  sends  a  message  as  indicated,  encloses  a  REPLY  link,  puts 
the  REPLY  link's  code  into  the  global  variable  "unique",  and  in¬ 
vokes  "recall".  The  latter  receives  a  message  with  a  five  second 
delay,  and  checks  that  the  incoming  message  has  the  proper  code 
("unique")  and  note  ("DATA"). 

The  assembler  file  "reset. s"  contains  "setexit"  and  "reset". 
The  routine  "setexit"  saves  register  5  and  the  old  program 
counter  in  global  locations  "sr5"  and  "spc".  The  routine 
"reset",  by  restoring  these,  effects  a  return  to  the  environment 
which  last  called  "setexit". 

The  file  "user.h"  contains  various  macros  freely  referred  to 
in  this  documentation.  For  the  user's  convenience,  it  also  de- 
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fines  TRUE  (all  l's)  and  FALSE  (all  O's)  and  the  following  struc¬ 
tures  : 

struct  {char  lowbyte ,highbyte; } ; 

struct  {int  wordl ,word2; } ; 

The  file  "time.u"  contains  the  routines  "datetol"  and  "lto- 
date"  ,  which  convert  character  strings  into  long  integers 
(representing  seconds  since  the  beginning  of  1973)  and  vice  ver¬ 
sa,  respectively.  The  array  "calendar"  contains  the  number  of 
days  preceding  each  month  in  a  leap  year,  with  pseudodata  "366" 
as  a  thirteenth  entry.  Character  string  arrays  store  the  days  of 
the  week  and  months  of  the  year.  The  macro  FOURYEARS  gives  the 
number  of  days  in  a  four-year  period.  The  first  step  of  "da¬ 
tetol"  is  to  convert  the  given  string,  with  format  "yymmddhhmmss" 
into  an  array  of  six  two-digit  integers.  Arrays  "lbound"  and 
"ubound"  are  used  to  check  that  these  integers  are  reasonable. 
Sizes  of  months  are  also  checked  by  subtracting  the  appropriate 
consecutive  entries  in  "calendar".  The  number  of  days  is  calcu¬ 
lated  by  computing  the  number  of  four-year  intervals  beginning 
with  1973  (and  multiplying  by  FOURYEARS) ,  then  adding  on  the 
proper  (0-3)  number  of  (non-leap)  years  (times  365),  then  adding 
on  the  month  offset  as  found  in  "calendar",  and  finally  adding  in 
the  day  of  the  month.  For  non-leap  years,  February  29th  is 
caught  as  a  mistake,  and  any  day  occurring  later  in  the  year  is 
decreased  by  one.  Finally,  hours,  minutes,  and  seconds  are  added 
on.  The  reverse  process  is  carried  out  by  "Itodate".  Seconds, 
minutes,  and  hours  are  first  removed.  The  day  of  the  week  is 
computed  from  the  number  of  days  modulo  7.  Division  by  FOURYEARS 
determines  the  four-year  period;  the  remainder  determines  the  ex- 
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act  year  and  day  within  the  year,  with  a  remainder  of  (4*355)  re- 
pesenting  December  31st  of  a  leap  year.  In  a  non-leap  year, 
conversion  (to  the  proper  format  for  "calendar")  is  performed  by 
increasing  by  one  any  day  of  the  year  larger  than  58.  (February 
28th  remains  58;  March  1st  is  bumped  to  60;  etc.)  The  month  is 
calculated  by  dividing  the  number  of  days  by  30;  the  answer  may 
be  too  large  by  one  and  is  corrected  by  inspecting  "calendar". 
As  the  result  is  computed,  it  is  edited  into  a  character  string. 
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SIGNIFICANCE  AND  EXPLANATION 


Arachne  is  an  experimental  operating  system  for  controlling  a  network 
of  microcomputers.  It  is  currently  implemented  at  the  University  of 
Wisconsin  on  a  network  of  five  minicomputers.  Some  of  its  essential 
features  are:  All  processors  are  identical,  although  they  may  differ  in 
peripheral  units.  No  memory  is  shared  between  processors,  and  all  com¬ 
munication  involves  messages  passed  between  processes.  The  way  in  which 
the  processors  are  interconnected  is  not  important.  The  network  appears 
to  the  user  to  be  a  sinqle  machine. 

This  report  describes  Arachne  from  the  viewpoint  of  a  user  or  a 
writer  of  user-level  programs. 


The  responsibility  for  the  wording  and  views  expressed  in  this  descriptive 
summary  lies  with  MRC,  and  not  with  the  authors  of  this  report. 
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ARACHNE  USER  GUIDE 
Version  1.2 

Raphael  Finkel ,  Marvin  Solonon  and  Ron  Tischler 


1.  INTRODUCTION 


Arachne  is  an  experimental  operating  system  for  controlling 
a  network  of  microcomputers.  It  is  currently  implemented  on  a 
network  of  five  Digital  Equipment  Corporation  LSI-11  computers 
connected  by  medium-speed  lines.*  The  essential  features  of 
Roscoe  are: 

1.  All  processors  are  identical.  Similarly,  all  processors 
run  the  same  operating  system  kernel.  However,  they  may  differ 
in  the  peripheral  units  connected  to  them. 

2.  No  memory  is  shared  between  processors.  All  commurnca- 
tion  involves  messages  explicitly  passed  between  physically  con¬ 
nected  processors. 

3.  No  assumptions  are  made  about  the  topology  of  intercon¬ 
nection  except  that  the  network  is  connected  (that  is,  there  is  a  . 
path  between  each  pair  of  processors) .  The  connecting  hardware 
is  assumed  to  be  sufficiently  fast  that  concurrent  processes  can 
cooperate  in  performing  tasks. 


Appeared  as  Computer  Sciences  Technical  Report  379,  Computer 
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forced  to  change  the  name  of  the  Poscoe  distributed  operating 
system,  since  Roscoe  is  a  registered  trademark  of  Applied  Data 
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the  operating  system  and  research  continue  unchanged. 

★ 

This  equipment  was  purchased  with  funds  from  National  Science 
Foundation  Research  Grant  #.MCS77-08963 . 

Sponsored  by  the  United  States  Army  under  Contract  Nos.  DAAC-29- 
75-C-0024  and  DAAG29-80-C-904 1 . 


4.  The  network  appears  to  the  user  to  be  a  single  powerful 
machine.  A  process  runs  on  one  machine,  but  communicating 
processes  have  no  need  to  know  if  they  are  on  the  same  processor 
and  no  way  of  finding  out.  (Migration  of  processes  to  improve 
performance  is  transparent  to  the  processes  involved.) 

5.  The  network  is  constructed  entirely  from  hardware  com¬ 
ponents  commercially  available  at  the  time  of  construction  (Janu¬ 
ary,  1978)  . 

6.  The  software  is  all  functional.  Although  Roscoe  has 
undergone  much  revision,  it  has  been  working  for  over  a  year. 

1.1  Purpose  of  this  Document 

This  document  describes  Arachne  from  the  point  of  view  of  a 
user  or  user-programmer.  It  is  both  a  tutorial  and  a  reference 
guide  to  the  facilities  provided  to  the  user.  All  information 
necessary  to  the  programmer  of  applications  programs  should  be 
found  here. 

Further  discussion  of  the  concepts  and  goals  of  Arachne  are 
discussed  in  [Solomon  78,  79].  That  document  also  lists  some 
research  problems  that  the  Arachne  project  intends  to  investi- 
gate.  The  operating  system  kernel  that  provides  the  facilities 
listed  below  is  described  in  considerable  detail  in  [Finkel  78, 
80b],  Similar  detailed  documentation  about  utility  processes 
(such  as  the  File  System  Process,  the  Teletype  Driver,  the  Com¬ 
mand  Interpreter,  and  the  Resource  Manager)  is  contained  in 
(Finkel  79a,  79b] . 

Arachne  has  been  developed  with  extensive  -use  of  the  UK IX 
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operating  system  [Ritchie  74].  All  code  (with  the  exception  of  a 
oma.ll  amount  of  as  sc. ably  language)  is  written  in  the  C  program¬ 
ming  language  (Kernighan  73] .  The  reader  of  this  document  is  as¬ 
sumed  to  be  familiar  with  both  UNIX  and  C. 

A  n ew  programming  language  called  Elmer,  is  being  designed 
for  applications  programs  under  Arachne;  it  will  be  described  in 
a  future  report.  Arachne  programs  may  be  written  in  either  Elmer 
or  C.  Currently,  the  library  is  available  only  in  C. 

1 . 2  Caveat 

Arachne  is  in  a  state  of  rapid  flux.  Therefore,  many  of  the  de¬ 
tails  described  in  this  Guide  are  likely  to  change.  The  reader 

who  intends  to  write  Arachne  programs  should  check  with  one  of 

• 

the  authors  of  this  report  for  updates.  ' 

1 . 3  Format  of  this  Gu  ide 

Section  2  provides  an  overview  of  the  concepts  and  facili¬ 
ties  of  Arachne.  Section  3  describes  the  facilities  by  name,  ar¬ 
ranged  according  to  general  subject  areas.  Section  4  is  a 
programmer's  reference  manual.  Each  function  is  listed  alphabet- 
ically ,  “its  syntax  and  purpose  are  described,  and  it  is  classi¬ 
fied  as  a  service  call  (an  invocation  of  an  operating  system  ker¬ 
nel  routine)  or  a  library  routine  (a  procedure  linked  into  the 
user  program).  Section  5  describes  the  command  line  interpreter 
and  lists  the  commands  that  may  be  entered  from  the  terminal. 

Seri  ion  6  describes  the  conventions  governing  terminal 

input/outpul. .  Section  7  presents  protocols  ‘for  communicating  ] 
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with  the  various  utility 


processes . 


1 . 4  Rev i s i ous 

The  following  changes  have  been  made  to  Arachne  since  ver¬ 
sion  1.0  of  this  document: 

There  is  a  new  service  call#  "linkok",  to  determine  if  a 
link  number  is  currently  valid.  The  library  routine  "call"  uses 
this  service  call  to  avoid  sending  a  message  across  a  bad  link. 

Messages  now  include  length  information.  The  library  rou¬ 
tines  "call"  and  "recall"  have  been  modified  to  reflect  this 
change.  The  file  and  terminal  protocols  have  also  been  simpli¬ 
fied. 

The  following  changes  have  been  made  since  version  1.1  of 
this  document; 

A  new  utility  process,  the  pipe,  is  now  available.  Pipes 
allow  the  output  of  one  user  process  to  be  attached  to  the  input 
of  another. 

The  structure  "uumcsg"  has  been  abolished,  and  "urmesg"  no 
longer  contains  the  body  of  the  message.  Instead,  both  "send” 
and  "receive"  have  a  new  argument  that  specifies  the  message 

f 

body . 

A  new  link  restriction,  MAYERROR,  is  orthogonal  to  all  other 
restrictions.  The  last  argument  to  send  may  have  the  ERROR  bit 
on,  in  which  case  the  message  is  considered  an  error  report  if 
the*  link  across  which  it  is  sent  has  MAYERROR  specified.  Receipt 
of  an  error  report  raises  an  exception. 

The  "die"  service  call  now  takes  a  character-string  argu- 


Pont.  This  argument  becomes  the  body  of  any  DESTROYED  message 
tnat  is  generated  due  to  the  termination  of  the  calling  process. 

When  a  process  dies,  error  reports  are  sent  along  any  links 
that  it  holds  with  restriction  MAYERROR  but  not  TELLDEST. 

Many  errors  caused  by  service  calls  raise  exceptions.  An 
exception  can  only  occur  during  a  service  call.  If  it  is  not 
caught,  the  guilty  process  terminates.  Exceptions  may  be  caught 
with  the  "errhandler"  service  call. 

A  new  facility  for  asynchronous  message  receipt,  called 
"catch",  allows  a  procedure  to  be  specified  that  will  be  invoked 
as  soon  as  a  message  arrives  on  the  specified  channels. 

The  "display"  kernel  call  returns  timing  information  about 
the  owner  of  any  link. 

We  have  been  forced  to  change  the  name  of  the  Roscoe  distri¬ 
buted  operating  system,  since  Roscoe  is  a  registered  trademark  of 
Applied  Data  Research,  Incorporated.  The  new  name  we  have  chosen 
is  Arachne;  the  operating  system  and  research  continue  unchanged. 

2.  ROSCOE  CONCEPTS  AND  FACILITIES 

The  fundamental  entities  in  Arachne  are:  files,  programs , 
core  images- ,  processes ,  links ,  and  messages .  The  first  four  of 
these  ate  roughly  equivalent  to  similar  concepts  in  other  operat¬ 
ing  systems;  the  concepts  of  links  and  messages  are  idiomatic  to 
Arachne.  A  file  is  a  sequence  of  characters  on  disk.  Each  file 
has  directory  information  giving  the  time  of  last  modification 
and  restrictions  on  reading,  writing,  and  execution.  The  con- 


tents  of  a  file  may  contain  header  information  that  further  iden¬ 
tifies  it  as  an  executable  program.  Version  1  of  Arachne  uses 
the  UNIX  file  system;  therefore,  the  reader  familiar  with  UNIX 
should'  have  no  problem  understanding  Arachne  files. 

Program  files  contain  text  (machine  instructions) ,  initial¬ 
ized  data ,  and  a  specification  of  the  size  of  the  uninitialized 
global  data  space  (bss)  required  by  the- program.  Program  files 
also  contain  relocation  information  and  an  optional  symbol  table. 

2 . 1  Processes 

A  process  is  a  locus  of  activity  executing  a  program.  Each 
process  is  associated  with  a  local  data  area  called  its  stack .  A 
program  that  never  modifies-  its  global  initialized  or  bss  data 
but  only  its  local  (stack)  data  is  re-entrant ,  and  may  be  shared 
by  several  processes  without  conflict.  A  main-storage  area  con¬ 
taining  the  text  of  a  program,  its  initialized  data,  and  a  bss 
data  area,  but  net  including  a  stack,  is  called  a  core  image .  C 
core  images  may  not  share  text  areas  unless  they  are  reentrant; 
the  text  and  data  areas  of  Elmer  programs  are  loaded  separately, 
so  Elmer  programs  may  share  text  even  if  they  are  not  reentrant. 
The  initiation  of  a  process  entails  locating  or  creating  (by 
load i ng)  a  core  image,  allocating  a  stack,  and  initializing  the 
necessary  tables  to  record  its  state  of  execution.  Similarly, 
when  a  process  dies,  its  tables  are  finalized  and  its  stack  space 
is  reclaimed.  If  no  other  processes  are  executing  in  its  core 
image,  then  the  space  occupied  by  the  core  image  is  available  for 
re-usc . 
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2.2  Links 


All  communication  is  performed  by  message  passing  across 
links.  A  link  combines  the  concepts  of  a  communications  path  and 
a  "capability."  A  link  represents  a  logical  one-way  connection 
between  two  processes,  and  should  not  be  confused  with  a  line, 
which  is  a  physical  connection  between  two  processors.  The  link 
concept  is  central  to  Arachne.  It  is  inspired  and  heavily  influ¬ 
enced  by  the  concept  of  the  same  name  in  the  Demos  operating  sys¬ 
tem  for  the  Cray-1  computer  [Baskett  77J .  Each  link  connects  two 
processes:  the  holder ,  which  may  send  messages  over  the  link,  and 
the  owner ,  which  receives  them.  The  holder  may  duplicate  the 
link  or  give  it  to  another  process,  subject  to  restrictions  asso¬ 
ciated  with  the  link  itself.  (See  "Link  restrictions"  below.) 
The  ownec  of  a  link,  on  the  other  hand,  never  changes. 

Links  are  created  by  their  owners.  When  a  link  is  cheated , 
the  creator  specifies  a  code  and  a  channel .  The  kernel  automati¬ 
cally  tags  each  incoming  message  with  the  code  and  channel  of  the 
link  over  which  it  was  sent.  Channels  are  used  by  a  process  to 
partition  the  links  it  owns  into  subsets:  When  a  process  wants  to 
receive  a  message,  it  specifies  a  set  of  channels.  Only  a  mes¬ 
sage  coming  over  a  link  corresponding  to  one  of  the  specified 
channels  is  eligible  for  reception.  A  link  is  named  by  its  hold¬ 
er  by  a  small  positive  integer  called  a  link  number  ,  which  is  an 
index  into  a  table  of  currently-held  links  maintained  by  the  ker¬ 
nel  for  the  holder.  All  information  about  a  link  is  stored  in 
this  table.  (No  information  about  a  link  is  stored  in  the  tables 


r.  i  the  owner.) 


2.3  Messages 


A  message  may  be  sent  by  the  holder  to  the  owner  of  a  link. 

A  message  may  contain,  in  addition  to  MSLEN  (currently  40) 
characters  of  text,  an  enclosed  1  ink .  The  sender  of  the  message 
specifies  the  link  number  of  a  link  it  currently  holds.  The  ker¬ 
nel  adds  an  entry  to  the  link  table  of  the  destination  process 
and  gives  its  link  number  to  the  recipient  of  the  message.  In 
this  way,  the  recipient  becomes  the  holder  of  the  enclosed  link. 
If  the  original  link  is  not  destroyed,  the  sender  and  the  reci¬ 
pient  hold  identical  copies  of  the  link. 


2.4  Link  restrictions 


Links  may  be  created  with  various  restrictions.  These  can 
be  characterized  as  modes,  permissions,  and  notifications.  The 
orthogonal  modes  are  REQUEST  and  REPLY.  A  reply  link  is  dis¬ 
tinguished  by  the  fact  that  it  can  only  be  used  once;  it  is  des¬ 
troyed  when  a  message  is  sent  over  it.  A  reply  link  may  not  be 
the  enclosed  link  in  a  message  sent  over  another  reply  link. 
Similarly,  a  request  link  cannot  be  sent  over  a  request  link. 
These  restrictions  enforce  a  communication  protocol  in  which  all 
communications  between  two  processes  connected  by  a  REQUEST  link 
are  initiated  by  the  holder  of  that  link. 


Two  permissions  are  G1VEALL  and  DUPALL,  controlling  distri¬ 
bution  of  the  affected  link  to  other  parties.  A  third  permission 
is  M;\YEKRGR ,  which,  allows  the  holder  to  send  bn  error  message, 
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When 


whose  receipt  will  raise  an  exception. 

The  notifications  are  TELLGIVE,  TELLDUP,  arid  TELLDEST. 
these  restrictions  are  in  force,  unforgeable  messages  are  sent  to 
the  owner  of  the  link  when  it  is  given  away,  duplicated,  or  des¬ 
troyed.  (The  last  of  these  messages  contains  a  body  provided  by 
the  holder  if  it  dies  holding  the  link.) 

2 . 5  Service  calls 

The  Arachne  kernel  is  a  module  that  resides  identically  on 
all  the  machines  of  the  network  and  provides  various  services  for 
user  programs.  The  services  are  requested  by  means  of  service 
calls ,  which  appear  to  the  caller  to  be  procedure  invocations. 

The  chief  function  of  the  kernel  is  to  support  link  mainte¬ 
nance  and  message  passing  by  providing  service  calls  to  create 
and  destroy  links  and  send,  receive  and  catch  messages.  Addi¬ 
tional  service  calls  create  and  destroy  processes,  read  and  set 
"wall-clock"  and  high-resolution  interval  timers,  specify  a 
handler  to  catch  exceptions,  and  establish  interrupt  handlers  for 
processes  that  control  peripheral  devices. 

2 • 6  Util ity  processes 

Arachne  has  been  designed  so  that  as  many  as  possible  of  the 
traditional  operating  system  functions  are  provided  not  by  the 
kernel,  but  by  ordinary  processes.  These  util ity  processes  may 
invoke  service  calls  not  intended  to  be  used  by  the  casual  user, 
but  otherwise  tnev  behave  exactly  like  user  processes.  The  ter¬ 
minal  driver  is  an  example.  One  terminal  driver  resides  on  each 
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processor  that  has  a  terminal.  All  terminal  input/output  by  oth¬ 
er  processes  is  requested  by  messages  to  this  process.  It  under¬ 
stands  and  responds  to  most  commands  accepted  by  a  file  (see 
belov/)  ,  as  well  as  a  few  extra  ones,  such  as  "set  modes"  (e.g., 
echo/no  echo,  hard  copy/soft  copy) . 

A  file  manager  process  has  access  to  the  Arachne  file  sys¬ 
tem,  currently  implemented  on  the  supporting  PDP-11/40.  A  re¬ 
quest  to  open  a  file  sent  to  any  file  manager  process  causes  a 
link  to  be  created  representing  the  open  file.  To  the  user  of  a 
file,  the  open  file  behaves  like  a  process  that  understands  and 
responds  to  messages  requesting  read  and  write  operations.  The 
file  is  closed  by  destroying  the  link.  A  version  of  ‘the  file 
manager  that  uses  a  floppy  disk  instead  of  the  PDP-11/40  file 
system  is  also  available;  it  follows  the  same  protocols  as  the 
other  file  manager. 

The  most  complex  utility  process  is  the  resource  manager 
(RM)  .  Resource  managers  reside  on  all  processors  and  are  con¬ 
nected  by  a  network  of  links.  A  process  can  request  an  RM  to 
create  a  new  process.  The  RM  may  create  the  process  on  its  own 
machine  or  relay  the  request  to  another  RM  based  on  such  con¬ 
siderations  as  location  of  the  process  that  requested  the  crea¬ 
tion,  availability  of  free  memory,  proximity  of  resources  such  as 
devices  and  files,  and  the  possibility  that  the  required  program 
is  already  in  memory. 

The  new  process  is  started  with  a  link  to  its  RM,  over  which 
it  can  request  links  to  the  process  that  requested  its  creation, 
to  a  file  manager  process. 


to  a  terminal  driver, 


or  to  other 


resources.  The  RM  can  kill  the  process,  or  it  can  give  a  special 
link  to  another  process  (usually  a  terminal  driver)  that  may  be 
used  to  kill  it. 

2.7  Library  rout ines 

Functions  provided  by  service  calls  are  rather  primitive, 
and  communication  with  utility  processes  can  involve  complicated 
protocols.  An  extensive  library  of  routines  has  been  provided  to 
simplify  writing  of  programs  that  use  service  calls  and  utility 
processes.  These  routines  serve  to  hide  the  communication  neces¬ 
sary  to  accomplish  various  tasks,  and  make  it  especially  easy  to 
introduce  software  not  originally  designed  for  the  Arachne  en¬ 
vironment.  These  routines  can  only  be  used  with  C  programs;  the 
Elmer  library  is  under  construction. 

3.  SUBJttT-AREA  GUIDE 

This  section  lists  service  calls  and  library  routines  by 
subject  area. 

3.1  Links  and  Messages 

A  new  link  is  created  by  a  process  through  the  "link"  ser¬ 
vice  call.  Initially,  the  creator  is  both  holder  and  owner  of 
the  link.  The  creator  specifies  what  channe]  and  code  to  associ¬ 
ate  with  the  link,  so  that  future  messages  arriving  along  it  can 
be  selectively  received  and  identified.  In  addition,  the  creator 
may  place  restrictions  on  the  use  of  the  1  ink ,  'control 1 ing  wheth- 
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er  Of  not  it  may  be  given  to  third  parties,  duplicated,  or  used 
repeatedly,  and  requiring  notifications  to  be  sent  along  it  in 
the  event  of  link  duplication,  transferral,  or  destruction.  Fi¬ 
nally,  links  may  specify  that  they  can  carry  error  messages.  Re¬ 
ceipt  of  an  error  message  terminates  the  recipient. 

Messages  are  sent  with  the  "send"  service  call,  which  speci¬ 
fies  a  link  over  which  the  message  is  to  be  sent,  the  message 
text  and  an  optional  enclosed  link.  It  also  indicates  if  the 
message  is  an  error  message. 

Messages  are  accepted  by  "receive,"  which  specifies  a  set  of 
channels,  a  place  to  put  the  message,  and  a  maximum  time  the  re¬ 
cipient  is  willing  to  wait.  "Receive"  can  also  be  used  to  sleep 
a  specified  period  of  time  by  waiting  for  a  message  that  will 
never  arrive.  Asynchronous  message  receipt  is  accomplished  by 
"catch",  which  has  the  same  arguments  as  receive,  except  it  has 
no  wait  time,  and  it  specifies  a  procedure  to  call  when  an  ap¬ 
propriate  message  arrives.  This  catcher  procedure  is  very  limit¬ 
ed  in  the  kernel  calls  it  can  perform. 

A  simple  send-receive  protocol  is  embodied  in  the  library 
functions  "call"  and  "recall,"  which  are  simpler  to  use  than  send 
and  receive,  and  should  be  adequate  for  most  routine  communica¬ 
tion.  The  "call"  library  routine  sends  a  message  along  a  given 
link,  enclosing  a  reply  link.  It  then  waits  five  seconds  for  a 
response,  which  it  returns  to  the  caller.  If  no  answer  has  ar¬ 
rived  in  five  seconds,  it  returns  failure,  and  the  "recall"  rou¬ 
tine  can  be  invoked  to  continue  waiting  for  the  tardy  response. 
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3 . 2  Processes 

A  process  may  spawn  others  by  communicating  with  the 
resource  manager;  typical  cases  are  handled  by  the  library  rou¬ 
tine  "fork".  The  requestor  indicates  whether  the  child  should  be 
run  as  a  foreground,  background,  or  detached  job.  Foreground 
processes  are  attached  to  a  terminal  and  can  be  terminated  by  en¬ 
try  of  a  contrql-C.  Background  processes  may  only  be  terminated 
by  requesting  the  resource  manager  to  remove  them,  which  is  ac¬ 
complished  by  the  library  routine  "killoff".  Detached  processes 
cannot  be  terminated  except  at  their  own  request.  The  caller 
also  indicates  whether  the  child  process  may  share  its  core  image 
with  other  processes,  whether  an  old  and  inactive  core  image  may 
be  used,  or  whether  a  fresh  core  image  is  required. 

Every  user  process  is  started  holding  link  number  0,  whose 
destination  is  the  resource  manager  on  that  process's  machine. 
When  calling  "fork",  the  parent  may  indicate  a  link  that  it 
wishes  to  give  to  the  child;  the  child  obtains  this  link  with  the 
library  routine  "parline",  which  communicates  with  the  resource 
manager  along  link  0.  A  process  can  terminate  itself  by  calling 
"die";  it  can  yield  the  CPU  to  another  process  by  the  service 
call  "nice".  (Scheduling  is  not  pre-emptive.) 

Four  low-level  process-control  service  calls  are  provided 
for  the  use  of  the  resource  manager;  they  are  not  intended  for 
the  typical  user.  The  service  call  "load"  arranges  for  bringing 
new  core  images  into  the  processor  on  which  the  caller  resides. 
If  there  is  no  room,  the  call  returns  failure,  and  the  resource 
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manager  can  try  to  find  a  neighboring  resource  manager.  that  might 
have  better  luck.  Once  a  core  image  is  loaded,  processes  can  be 
started  in  it  with  the  service  call  "startup",  which  provides  the 
new  process  with  an  initial  link  0  of  the  caller's  choosing.  The 
"kill"  service  call  removes  a  process,  and  "remove"  reclaims  its 
core  image.  The  separation  of  images  and  processes  allows  one 
core  image  to  be  used  simultaneously  be  several  processes,  and  a 
core  image  may  be  saved  after  the  last  process  is  gone  to  speed 
up  the  next  invocation  of  a  process  that  would  use  it. 


3.3  Timinc 


Arachne  has  two  notions  of  time.  One  is  the  wall  clock, 
which  keeps  track  of  seconds  in  real  time.  Messages  sent  between 
resource  managers  are  routinely  used  to  keep  the  various  machines 
synchronized.  There  is  also  an  interval  timer,  which  may  be  used 
to  monitor  elapsed  time  in  increments  of  ten-thousandths  of 
seconds.  No  process  may  change  the  interval  timer. 

The  wall  clock  is  referenced,  changed,  enciphered,  and  deci¬ 
phered  by  "date",  "setdate",  "datetol",  and  "ltodate",  respec¬ 
tively.  The  interval  timer  is  referenced  by  "time".  The  percen¬ 
tage  of  time  used  by  any  process  may  be  discovered  with 
"display" . 


3 . 4  Interrupts  and  Exceptions 

User  programs  may  handle  their  own  interrupts.  This  f-  ature 
is  currently  used  by  the  terminal  driver.  A  process  may  estab¬ 
lish  an  interrupt-level  routine  with  the  "handler"  service  call. 
This  call  names  not  only  the  interrupt  handling  routine  and  which 
interrupt  it  is  intended  to  service,  but  also  a  channel  along 
which  to  receive  messages  sent  by  that  interrupt  routine.  The 
interrupt-level  routine  should,  of  course,  be  thoroughly  debugged 
and  fast.  Interrupt-level  routines  may  notify  the  process  that 
established  them  by  the  service  call  "awaken".  This  call  causes 
a  special  message  to  be  sent  to  the  master  routine  along  the 
channel  it  specified  in  its  "handler"  call.  Since  the  master  and 
interrupt-level  routines  share  code  and  data,  all  details  of  the 
communication  are  embedded  in  shared  variables,*  the  awaken  mes¬ 
sage  itself  is  empty. 

If  a  processes  arranges  for  asynchronous  receipt  of  messages 
by  using  a  "catch"  service  call  (see  "Links  and  Messages"  above) , 
then  arrival  of  such  a  caught  message  will  not  preempt  any  other 
process.  However  if  the  catching  process  is  currently  executing, 
control  will  immediately  switch  to  the  Cu* -  her  routine  within  the 
process. 

Exceptions  are  raised  by  many  service  all  errors  (usually 
poorly  formed  service  calls)  and  by  receipt  of  error  messages. 
Usually,  exceptions  cause  the  termination  of  the  offending  pro¬ 
cess.  Exceptions  may  be  caught  by  establishing  a  handler  with 
the  "errhandler"  service  call.  When  an  exception  arise.-,  the 


-15- 


handler  will  be  invoked  with  arguments  indicating  the  value  re¬ 
turned  from  the  failed  service  call,  the  service  call  number,  and 
all  the  arguments  to  the  service  call.  Return  from  the  handler 
acts  like  return  from  the  service  call. 

3 . 5  Input/Output 

To  use  files,  a  process  first  obtains  a  link  to  the  file 
manager  process  by  calling  the  library  routine  "fsline",  which 
communicates  with  the  local  resource  manager.  This  link  is  used 
in  subsequent  library  routine  calls:  "open"  and  "create"  make  new 
files  or  ready  old  ones  for  reading  or  writing,  and  return  links 
to  be  used  for  manipulations  of  those  files.  The  library  rou¬ 
tines  "read",  "write",  and  "seek"  act  much  like  the  Unix  file 
primitives  of  the  same  name  to  provide  random  access  into  the 
open  file.  A  file  is  closed  by  the  library  routine  "close", 
which  is  identical  to  the  service  call  "destroy",  which  destroys 
a  link.  Finally,  the  library  routine  "stat"  returns  various  in¬ 
formation  about  the  open  file.  Each  of  these  library  routines 
packages  a  request  in  a  message  that  is  sent  across  the  file  ac¬ 
cess  link  to  the  file  manager  process. 

To  use  the  terminal,  a  process  obtains  input  and  output 
links  by  calling  the  library  routines  "inline"  and  "outline", 
respectively,  which  communicate  with  the  local  resource  manager. 
An  input  link  can  be  used  to  discover  or  change  terminal  modes 
(only  the  command  interpreter  uses  this  feature)  and  to  perform 
terminal  input.  An  output  link  can  be  used  for  teraiinal  output. 
These  links  may  also  be  "closed";  they  are  closed  automatically 


The  terminal  driver  allows  at  most  one  in- 


v.  non  a  process  dies, 
put  link  to  be  open  at  a  time. 

Reading  is  performed  by  the  library  routines  "read"  and 
"readline".  Writing  is  performed  by  "write"  and,  if  formatting 
is  desired,  by  "print".  Each  of  these  routines  works  equally 
well  in  dealing  with  a  file  instead  of  the  terminal.  The  service 
call  "printf"  is  identical  to  "print"  except  that  it  always  uses 
the  terminal;  it  is  a  debugging  tool  not  intended  for  the  typical 
user . 

The  user  familiar  with  UNIX  is  cautioned  against  assuming 
that  any  particular  buffer  size  is  particularly  efficient  for 
reads  or  writes,  because  Arachne  splits  up  I/O  into  packets  of 
size  MSLEN  bytes  anyway. 

3 . 6  Miscellaneous  Routines 

The  following  routines  from  the  C  library  also  exist  in  the 
Arachne  library:  atoi,  long  arithmetic  routines,  reset,  setex- 
it,  streopy,  streq,  strge,  strgt,  strle,  strlen,  strlt,  strne, 
and  subsizr. 

An  additional  routine  supplied  by  Arachne  is  "copy". 

3 • 7  Prepar ing  User  Programs 

User  programs  for  Arachne  are  written  in  the  C  programming 
language.  They  are  compiled  under  UNIX  on  the  PDP-11/40  and 
should  include  the  files  "user.h"  and  "util.h"  in  directory 
/uar/network/roscoe/user .  Source  programs  should  have  filenames 
ending  with  To  prepare  a  file  named  "foo.u",  execute 
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"makeuser  foo",  which  creates  an  executable  file  for  Arachne 


named  "foo".  The  executable  files  are  always  stored  in 
/usr/network/roscoe/user . 

4.  ROSCOE  PROGRAMMER'S  MANUAL 

The  following  is  an  alphabetized  list  of  all  the  Arachne 
service  calls  and  library  routines.  For  each  service  call  error 
result,  the  notation  "(*)"  indicates  that  the  error  causes  an  ex¬ 
ception  to  be  raised. 

4.1  A1 ias  (Library  Routine) 

int  al ias( fsl ink ,fnamel , fname2)  char  *fnamel,  *fname2; 

The  new  name  "fname2"  is  associated  with  file  "fnamel".  The 
argument  "fsl ink"  is  the  caller's  link  to  the  file  manager.  The 
old  name  is  still  valid.  Possible  errors:  The  combined  length 
of  "fnamel"  and  "fname2"  must  not  exceed  MSLEN-6 .  The  name 
"fname2"  must  not  already  be  in  use.  File  "fnamel"  must  exist. 
All  errors  return  -1. 

4.2  Awaken  (Service  Call ) 
awaken () 

Only  an  interrupt  level  routine  may  use  this  call.  It  sends 
a  message  to  the  process  that  performed  the  corresponding 
"handler"  call  along  the  channel  specified  by  that  "handler" 
call . 


Returned  values:  Success  returns  a  value  of  0.  -2  is  re- 
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turned  if  the  message  cannot  be  sent  because  no  buffers  are 
available;  an  "awaken"  may  succeed  later. 

4.3  Call  (Library  Routine) 

int  call(ul ink, outmess,  inmess , outlen,  inlen) 
char  *outmess,*-inmess;  int  *  inlen; 

This  routine  sends  a  message  to  another  process  and  receives 
a  reply.  The  link  over  which  the  message  is  sent  is  "ulink", 
which  should  be.  a  REQUEST  link.  The  argument  "outmess"  points  to 
the  message  body  to  be  sent,  of  size  "outlen".  (Caution: 
"outlen"  should  include  the  terminating  null,  if  the  message  is  a 
string.)  Similarly,  "inmess"  points  to  where  the  reply  body  will 
be  put.  The  length  of  the  reply  will  be  placed  in  the  integer 
pointed  to  by  "inlen";  if  the  user  doesn't  need  this  feature, 
"inlen"  may  be  set  to  0.  If  "inmess"  is  0,  any  reply  will  be 
discarded.  An  error  is  reported  if  the  reply  does  not  arrive  in 
five  seconds  (see  "recall").  In  normal  cases,  the  return  value 
is  the  link  enclosed  in  the  return  message;  it  is  -1  if  there 
isn't  any  enclosure.  Ignoring  errors,  the  user  may  consider  this 
routine  an  abbreviation  for: 
struct  urmesg  urmess; 

send { ul ink ,1 ink(0 ,CHAN16, REPLY) , outmess, outlen, NODUP) ; 
r ece ive (CHAN16 , inmess ,& urmess , 5) ; 
if  (inlen)  * inlen  =  urmess . urlength; 
return(urmess.urlnenc) ; 

Returned  values:  Under  normal  circumstances,  the  return 
value  is  either  -1  or  a  link  number.  -2  means  an  error  occurred 
while  sending,  -3  means  the  waiting  time  expired,  -4  means  that 
the  return  link  was  destroyed,  -5  means  that  something  was  rc- 
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ceived  with  the  wrong  code,  meaning  that  the  user  program  is  also 


nr 


using  CHAM16  for  some  other  purpose,  -6  means  that  a  return  link 
couldn't  be  created  in  the  first  place,  -7  means  that  the  ul ink 
was  bad. 

NOTE:  CHAN16  is  implicitly  used;  for  this  reason,  the  user  is 
advised  to  avoid  this  channel  entirely.  Several  other  library 
routines  also  invoke  "call",  and  thus  use  CHAN16 . 

4.4  Catch  (Service  Call) 

int  catch ( chans , data , urmess , catcher)  char  *data,  int  catcher(); 

struct  urmesg  {  /*  for  receiving  messages  */ 

int  urcode;  /*  chosen  by  user,  see  "link"  */ 

int  urnote;  /*  filled  in  by  Arachne,  see  "receive"  */ 

int  urchan;  /*  chosen  by  user,  see  "link"  */ 

int  urlnenc;  /*  index  of  enclosed  link  */ 

int  urlength;  /*  length  of  incoming  message  */ 

}  *urmess; 

The  arguments  are  the  same  as  for  the  receive  service  call, 
except  for  the  last  one.  The  procedure  specified  by  "catcher"  is 
activated  as  an  asynchronous  message  recipient  for  messages  that 
appear  on  the  channels  indicated.  If  a  catcher  is  active  on  some 
channel,  then  any  message  that  arrives  on  that  channel  will  cause 
the  asynchronous  invocation  of  the  catcher,  which  takes  no  argu¬ 
ments.  The  message  itself  is  placed  in  "data"  and  "urmess"  in 
the  same  way  as  for  "receive". 

The  catcher  procedure  may  inspect  the  message  and  modify 
global  variables;  it  may  not  invoke  any  service  calls  except 
"printf".  If  the  catcher  returns  FALSE,  it  will  be  deactivated 
from  the  channel  across  which  the  message  came;  if  it  returns 
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TRUE,  it  regains  active. 

If  a  catcher  has  already  been  activated  for  some  channels, 
and  a  new  "catch"  call  names  other  channels,  then  the  anion  of 
all  the  channels  active  before  and  now  indicated  will  be  activat¬ 
ed  for  catchers.  There  is  only  one  catcher  procedure,  one 
"data",  and  one  "urmess"  at  any  time;  subsequent  "catch"  calls 
can  replace  these  values  with  new  ones. 

If  "catcher"  is  0,  then  instead  of  activating  the  given 
channels,  they  are  deactivated  with  respect  to  catching  messages. 
All  channels  not  mentioned  in  "chans"  are  unaffected.  The  "data" 
and  "urmess"  arguments  are  ignored  in  this  case. 

If  the  destination  of  a  message  is  both  waiting  to  receive 
it  and  has  a  catcher  activated  to  catch  it,  the  message  is  given 
to  the  catcher,  not  the  receive  call.  Catching  a  message 
prevents  it  from  also  being  received. 

Messages  are  caught  in  the  order  in  which  they  arrive  at  the 
destination. 

Returned  values:  0  is  returned  on  success.  -1  (*)  means 

the  argument  "catcher"  was  bad,  -2  (*)  means  "urmess"  or  "data" 
was  bad.  In  this  case,  the  other  specified  channels  may  or  may 
not  get  catchers. 

4.5  Close  (Library  Routine) 

int  close(file) 

The  argument  "file"  is  either  a  link  to  an  open  file,  or  a 
terminal  input  or  output  link.  The  returned  value  is  0  on  sue 
cess,  negative  on  failure  (actually,  "close"  is  synonymous  with 
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"destroy")*  These  links  are  automatically  closed  when  a  process 
dies;  however,  execution  of  this  command  gives  the  caller  more 
room  in  its  link  table.  Also,  closing  the  terminal  input  makes 
it  possible  for  another  process  to  open  it. 

4.6  Copy  (Service  Call) 
int  copy(link) ; 

This  service  call  returns  a  copy  of  the  given  link.  If  the 
link  is  restricted,  copy  may  fail  or  cause  a  notification  to  be 
sent.  Returned  values;  0  for  success,  -1  (*) ,  -2  (*)  if  the  ori¬ 
ginal  link  number  is  out  of  range  or  not  in  use,  -3  (*)  if  the 
link  is  protected  against  duplication,  -4  (*)  if  there  is  no  room 
in  the  user's  link  table  for  a  new  link.  (See  "link".) 

4 . 7  Create  (Library  Routine) 

int  create ( f si ink ,fname,mode)  char  *fname; 

If  the  file  named  "fname"  exists,  it  is  opened  for  writing 
and  truncated  to  zero  length.  If  it  doesn't  exist,  it  is  created 
and  opened  for  writing.  The  argument  "fslink"  is  the  caller's 
link  to  the  file  manager.  The  protection  bits  for  the  new  file 
are  specified  by  "mode";  these  bits  have  the  same  meaning  as  for 
UNIX  files,  but  all  files  on  Arachne  have  the  same  owner.  The 
returned  values  are  as  in  "open". 
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4.8  Date  (Service  Call ) 


long  date( ) ; 

This  service  call  returns  the  value  of  the  wall  clock,  which 
is  a  long  integer  representing  the  number  of  seconds  since  mid¬ 
night,  Jan  1,  1973,  CDT . 

4.9  Datetol  (Library  Routine) 
long  datetol(s)  char  s[12]j 

This  library  routine  converts  a  character  array  with  format 
"yymmddhhmmss"  into  a  long  integer,  representing  the  number  of 
seconds  since  midnight  (00:00:00)  Jan  1,  1973.  It  accepts  dates 
up  to  991231235959  (end  of  1999);  -1  is  returned  on  error. 

4.10  Destroy  (Service  Call) 
int  destroy (ul in!;) 

Link  number  "ulink"  is  removed  from  the  caller's  link  table. 
Returned  values:  0  is  returned  on  success.  -1  (*)  means 

that  the  link  number  is  out  of  range,  -2  (*)  means  that  it  is  an 
invalid  link,  and  t3  (*)  means  the  link  may  not  be  destroyed 
(link  0  has  this  property). 

4.11  Die  (Service  Call) 
die(mesg)  char  *mesg; 

This  call  terminates  the  caller.  All  links  held  by  the 
caller  are  destroyed.  As  these  links  are  destroyed,  DESTROYED 
messages  are  sent  along  all  links  that  have  the  TELLDEST  restric- 
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tion;  these  messages  contain  "mesg"  as  the  body  (always  MSLEN 
bytes) ,  unless  "mesg"  is  0.  Error  messages  are  sent  along  all 
links  that  nave  the  MAYSRROR  restriction  but  not  TELLDEST. 

Various  errors  can  cause  a  '‘die'1  to  be  automatically  gen¬ 
erated.  Here  are  the  possible  contents  of  "mesg": 

bad  die  message 
bad  trap 

exception  not  caught 
killed 

fell  through 
core  image  damaged 

4.12  Display  (Service  Call) 
int  display (1  ink) ; 

This  call  returns  a  number  in  the  range  0  to  100  that 
represents  the  percentage  of  CPU  time  used  by  the  owner  of  the 
given  link  averaged  over  the  last  4  seconds.  0  means  that  the 
process  has  not  run  at  all;  100  means  that  the  process  has  been 
active  the  entire  time. 

Returned  values:  -1  (*)  if  the  link  points  to  a  different 
machine,  -2  (*)  if  the  link  number  is  invalid. 

4.13  Errhandler  (Service  Call ) 

char  *errhandler (addr)  char  *addr; 

A  new  exception  handler  is  established  to  catch  exceptions 
raised  during  service  calls  and  receipt  of  error  messages.  The 
handler  is  a  routine  at  location  "addr".  The  old  handler  address 
is  returned.  A  0  value  for  "addr"  disables  exception  catching. 

If  not  caught,  exceptions  cause  the  termination  of  the  of- 
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fending  process.  When  an  exception  arises,  the  handler  will  be 
invoked  with  these  arguments:  the  value  returned  from  the  failed 
service  call,  the  service  call  numoer,  and  all  the  arguments  to 
the  service  call.  Return  from  the  handler  acts  like  return  from 
the  service  call.  To  ignore  exceptions,  use  a  handler  that  only 
returns  its  first  argument. 

Returned  values:  -1  (*)  if  "addr"  is  unreasonable,  the  ad¬ 
dress  of  the  old  handler  (possibly  0)  otherwise. 

4.14  Fork  (Library  Rout ine) 

int  fork(fname,arg,mode)  char  *fname; 

The  resource  manager  starts  a  new  process  running  the  pro¬ 
gram  found  in  the  file  named  "fname",  which  must  be  in  executable 
load  format.  The  function  named  "main"  is  called  with  the  in¬ 
teger  argument  "arg".  "Mode"  is  a  combination  (logical  "or")  of 
the  following  flags,  defined  in  "user.h": 

one  of  these:  FOREGROUND,  BACKGROUND,  or  DETACHED 

and  one  of  these:  SHARE,  REUSE,  EXCLUSIVE,  or  VIRGIN 

If  FOREGROUND  is  specified,  then  the  new  process  can  be  killed  by 
entering  a  control-C  on  the  console.  FOREGROUND  is  mainly  used 
by  the  command  interpreter.  If  BACKGROUND  is  specified,  then  a 
"process  identifier"  is  returned  that  may  be  used  to  subsequently 
"killoff"  the  child.  DETACHED  (i.e.,  neither  FOREGROUND  nor 
BACKGROUND)  is  the  default.  If  SHARE  is  specified,  then  the 
resource  manager  will  be  willing  to  start  this  new  process  in  the 
sane-  code  space  as  another  process  executing  the  same  file,  if 
'<■  1  ;  f  or  ocoj  ;  w.iS  a  i  :.o  spawned  in  SHARE  mode.  If  REUSE  i  r,  sjx  c  i  - 
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fied,  the  code  space  of  an  earlier  process  can  be  reused.  If  EX¬ 
CLUSIVE  is  specified,  then  this  process  may  not  be  started  on  a 
machine  which  already  has  a  process  using  the  same  executable 
file.  VIRGIN  means  that  a  new  copy  must  be  loaded,  and  is  the 
default.  If  the  call  succeeds,  a  link  of  type  REQUEST  and  TELLD- 
EST  is  given  to  the  resource  manager;  the  child  may  obtain  this 
link  by  invoking  "parline".  The  caller  may  receive  messages  from 
the  child  over  this  link,  which  has  code  0  and  channel  CHAN14. 

A  returned  value  of  -1  indicates  an  error.  Success  is  indi¬ 
cated  by  a  return  value  of  0,  except  in  the  case  of  BACKGROUND 
mode,  when  the  return  value  is  a  "process  identifier". 

4.15  Fsline  (Library  Routine) 
int  fslineO  ; 

This  routine  returns  the  number  of  a  REQUEST  link  to  be  used 
for  communication  with  the  file  manager  Process.  An  error  gives 
a  returned  value  of  -1. 

4.16  Handler  (Service  Call? 

handler (vector ,func, chan)  (*func) (); 

The  address  of  a  device  vector  in  low  core  is  specified  by 
"vector".  The  interrupt  vector  is  initialized  so  that  when  an 
interrupt  occurs,  the  specified  routine  "func"  is  called  at  in¬ 
terrupt  level.  If  the  interrupt  level  routine  performs  an  "awak¬ 
en"  call,  a  message  will  arrive  on  channel  "chan"  with  urcodo  0 
and  ur note  "INTERRUPT"  (see  "receive"). 

Re  mined  values:  Success  returns  a  value  of  0.  -I  ( ) 
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means  that  there  have  boon  ton  many  handler  calls  on  that  machine 
(the  limit  is  currently  2).  -2  (*)  means  that  the  channel  is  in¬ 

valid.  ■  -3  (*)  means  that  the  vector  address  is  unreasonable.  -4 
{*)  means  that  the  vector  is  already  in  use. 

4.17  1 nl  ine  (Library  Routine) 

int  inline ( ) ; 

This  routine  returns  the  number  of  a  REQUEST  link  to  be  used 
for  subsequent  terminal  input.  The  terminal  driver  only  allows 
one  input  link  to  be  open  at  any  time.  An  error  returns  a  value 
of  -1. 

4.13  Kill  (Service  Call) 
k ill ( 1 ifel ine)  ; 

The  process  indicated  by  "lifeline"  (the  return  value  of  a 
successful  "startup"  call)  is  terminated  as  if  it  had  performed 
"die ( "killed") " .  The  lifeline  is  not  destroyed. 

Returned  values:  Success  returns  a  value  of  0.  -1  (*)  in¬ 

dicates  that  the  link  is  invalid  or  not  a  "lifeline". 

Only  the  resource  manager  and  terminal  driver  should  use 
this  call. 

4.19  K  ill  of f  (Library  Routine) 
int  k  j.llof  f  (proc id)  ; 

This  routine  asks  the  resource  manager  to  kill  a  process 
that  the  calling  process  previously  created  as  a  BACKGROUND  pro¬ 
cess  v.'ich  a  "fork"  request .  The  vn.luo  returned  from  that  "fork" 
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is  "procid".  The  effect  on  the  dead  process  is  as  if  it  had 
called  "die". 

Qis  returned  for  success,  -1  for  failure. 

4.20  Link  (Service  Call) 
int  link(code,chan,restr) 

A  new  link  is  created.  The  caller  becomes  the  new  link's 
owner  (forever)  and  holder  (usually  not  for  very  long) .  The 
caller  specifies  an  integer,  "code",  which  is  later  useful  to  the 
caller  to  associate  incoming  messages  with  that  link.  The  caller 
also  specifies  "chan"  as  one  of  sixteen  possibilities, 
CHAN1,  ...,  CHAN16,  which  are  integers  containing  exactly  one 
non-zero  bit.  Channels  are  used  to  receive  messages  selectively. 
CHAN16  should  be  avoided,  for  reasons  explained  in  "call". 
CHAN15  should  also  be  avoided,  since  the  kernel  uses  it  for  re¬ 
mote  loading.  The  returned  value  is  the  link  number  that  the 
caller  should  use  to  refer  to  the  link.  The  argument  "restr"  is 
the  sum  of  various  restriction  bits  that  tell  what  kind  of  link 
it  is.  The  possibilities  are: 

GIVEALL 

DUPALL 

TELLGIVE 

TELLDUP 

TELLDEST 

REQUEST 

REPLY 

MAY ERROR 

"GIVEALL"  means  that  any  holder  may  give  the  link  to  someone 
else.  "DUPnLL"  means  that  any  holder  may  duplicate  it  (i.e., 
give  it  to  someone  with  "dup"  -  DU?;  see  "send").  "TELLGIVE", 
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"TELLDUP" ,  and/or  "TLLLDEST"  cause  the  owner  to  be  notified  when¬ 
ever  a  holder  gives  away,  duplicates,  and/or  destroys  the  link, 
respectively  (see  "receive").  A  process  may  duplicate,  give 
away,  or  destroy  a  newly  created  link  without  restriction  and 
without  generating  notifications;  restrictions  and  notifications 
only  apply  to  links  received  in  messages.  A  link  must  be  either 
of  type  "REQUEST"  or  "REPLY''.  A  REPLY  link  cannot  be  duplicated 
and  disappears  after  one  use;  a  REQUEST  link  can  be  used  repeat¬ 
edly  unless  it  is  destroyed  by  its  holder.  An  enclosed  link  must 
always  be  of  the  opposite  type  from  the  link  over  which  it  is  be¬ 
ing  sent.  If  "MAYERROR"  is  specified,  then  error  messages  may  be 
sent  along  this  link.  (See. "send"  and  "receive".) 

Returned  values:  The  normal  return  value  is  a  non-negative 
link  number.  -1  (*)  means  that  the  link  was  specified  as  either 
both  or  neither  of  REPLY  and  REQUEST;  -2  (*)  means  that  the  chan¬ 
nel  is  invalid,  -3  (*)  means  there  is  no  room  for  a  new  link 
(currently  20  links  are  allowed  to  each  process). 

4.21  L inkok  ( Service  Call ) 

int  linkok(link) 

The  returned  value  is  0  if  the  link  number  is  currently 
valid,  -1  if  it  is  out  of  range,  and  -2  if  it  is  in  range  but 
does  not  denote  a  valid  link. 
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4.22  Load  (Service  Call) 


int  load (prog , fd , pi  ink ,arg)  char  *prog; 

This  call  loads  a  program.  If  "fd"  is  -1,  the  console 
operator  is  requested  to  load  "prog"  manually.  If  "fd"  is  a 
valid  link  number  (it  should  be  a  link,  to  an  open  file)  and 
"prog"  is  -l,  the  file  is  loaded  on  the  same  machine.  In  either 
of  these  cases,  the  return  value  is  an  "image",  to  be  used  for 
subsequent  "startup"  or  "remove"  calls. 

If  "fd"  is  a  link  and  "prog"  is  a  machine  number,  the  file 
is  loaded  remotely  on  the  corresponding  machine  and  started.  The 
arguments  "plink"  and  "arg"  have  the  same  meaning  as  in  the 
"startup"  call.  The  "plink"  is  automatically  given  (not  dupli¬ 
cated)  .  The  return  value  is  a  "lifeline",  as  for  a  "startup" 
call . 

Returned  values:  A  nonnegative  image  number  or  lifeline 
number  is  returned  on  success.  -2  (*)  and  -3  (*)  mean  that  the 
link  "fd"  was  out  of  range  or  was  invalid,  respectively.  -5 
means  that  there  wasn’t  room  for  the  new  image.  -6  means  that 
there  are  too  many  images.  -10  (*)  means  that  the  caller  had  no 
room  for  the  lifeline.  -11  (*)  means  that  the  "plink"  was  out  of 
range  or  had  an  invalid  destination. 

Only  the  resource  manager  should  use  this  call. 


4 .23  Ltodate  (Library  Routine) 

ltodate(n,s)  long  n;  char  s[30]; 

This  library  routine  converts  a  long  integer,  representing 
the  number  of  seconds  since  Jan  1,  1973,  into  a  readable  charac¬ 
ter  string  telling  the  time,  day  of  the. week,  and  date.  Dates 
later  than  1999  are  not  converted  correctly. 

4.24  N ice  (Service  Call) 
nice  ( ) 

This  call  allows  the  Arachne  scheduler  to  run  any  other 
runnable  process.  (Arachne  has  a  round-robin  non-pr e-empt ive 
scheduling  discipline;  "nice"  puts  the  currently  running  process 
at  the  bottom.)  It  is  used  to  avoid  busy  waits. 

4.25  Open  (Library  Routine) 

int  open(fslink,fname,mode)  char  *fname; 

The  file  named  "fname"  is  opened  for  reading  if  "mode"  is  0, 
for  writing  if  "mode"  is  1,  and  for  both  if  "mode"  is  2.  The  ar¬ 
gument  "fslink"  is  the  caller's  link  to  the  file  manager.  The 
returned  value  is  a  link  number,  used  for  subsequent  "read", 
"write",  and  "close"  operations.  This  link  may  be  given  to  other 
processes,  but  not  duplicated.  -1  is  returned  on  error. 


4.2G  Outline  (Library  Rout i ne) 
int  outl inc ( ) ; 

This  routine  returns  the  number  of  a  link  to  be  used  for 
subsequent  terminal  output.  An  error  returns  a  value  of  -1. 

4.27  Pari ine  (Library  Routine) 
parline ( ) ; 

This  routine  asks  the  resource  manager  for  a  link  to  the 
parent  of  the  caller.  It  assumes  that  the  parent  gave  the 
resource  manager  a  REQUEST  link  when  it  spawned  the  child.  An 
error  returns  a  value  of  -1. 

This  call  is  typically  used  by  a  program  being  run  by  the 
command  interpreter;  the  parent  link  (to  the  command  interpreter) 
is  used  to  get  the  command  line  arguments. 

4.28  Pr int  (Library  Routine) 

int  pr int (f ile, format, args. . .)  char  *format; 

This  routine  implements  a  simplified  version  of  UNIX's 
"printf".  The  argument  "file"  is  either  a  link  to  an  open  file 
or  a  terminal  output  link.  The  input  is  formatted  and  then 
"write"  is  called.  The  "format"  is  a  character  string  to  be 
written,  except  that  two-byte  sequences  beginning  with  ”%"  are 
treated  specially.  "%d",  "%o",  "%c",  "%w",  and  "%s"  stand  for 

decimal,  octal,  character,  long  integer,  and  string  format, 
respectively.  As  these  codes  are  encountered  in  the  format,  suc¬ 
cessive  "ores"  arc  written  in  the  indicated  manner.  (Unlike 
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"printf",  there  are  no  field  widths.)  A  followed  by  any 

character  other  than  the  above  possibilities  disappears,  so  ”%%" 
is  written  out  as  Only  6  arguments  are  allowed. 

4.29  Read  (Library  Ro  utine) 

int  read  (  file, buf  ,size)  char  *buf,*’ 

The  argument  "file"  is  either  a  link  to  an  open  file  or  a 
terminal  input  link.  At  most  "size"  bytes  are  read  into  the 
buffer  "buf";  fewer  are  read  if  end-of-file  occurs.  For  the  ter¬ 
minal,  control-D  is  interpreted  as  end-of-file.  The  returned 
value  is  the  number  of  bytes  actually  read. 

4.30  Read  line  (Library  Ro  utine) 

int  readline( file, buf , size)  char  *buf; 

This  routine  is  the  same  as  "read",  except  that  it  also 
stops  at  the  end  of  a  line.  For  a  file  a  "newline"  character  is 
interpreted  as  end-of-line;  however,  "readline"  is  very  ineffi¬ 
cient  for  files.  For  the  terminal,  a  "line-feed"  or  "carriage 
return"  terminates  a  line;  the  last  character  placed  in  the 
buffer  will  be  "newline"  (octal  12).  Control-D  or  control-W  will 
also  terminate  a  line,  but  they  will  not  be  included  in  the  bytes 
read.  The  returned  value  is  the  number  of  bytes  read. 


4.31  Recall  (Library  Routine) 

int  recall ( inmess, inlen)  char  *inmess;  int  *inlen; 

If  a  previous  "call"  '.or  "recall")  returned  a  value  of  -3, 
meaning  that  the  message  did  not  arrive  in  5  seconds,  a  process 
can  invoke  the  library  routine  "recall"  to  continue  waiting. 
Only  the  return  message  buffer  and  place  to  store  the  length  are 
specified  (cf .  "call") . 

Returned  values:  These  are  the  same  as  for  "call",  except 
that  -2  and  -6  don't  apply. 

4.32  Receive  (Service  Call) 

int  receive(chans, data, urraess, delay)  char  *data; 

struct  urmesg  {  /*  for  receiving  messages  */ 

int  urcode;  /*  chosen  by  user,  see  "link"  */ 

int  urnote;  /*  filled  in  by  Arachne,  see  "receive"  */ 

int  urchan;  /*  chosen  by  user,  see  "link"  */ 

int  urlnenc;  /*  index  of  enclosed  link  */ 

int  urlength;  /*  length  of  incoming  message  */ 

}  *urmess; 

The  caller  waits  until  a  message  arrives  on  one  of  several 
channels,  the  sum  of  which  is  specified  by  "chans".  All  other 
messages  remain  queued  for  later  receipt.  The  code  and  channel 
of  the  link  for  the  incoming  message  are  returned  in  "urcode"  and 
"urchan",  respectively. 

The  value  of  "urnote"  is  one  of  six  possibilities:  DUPPED, 

DESTROYED,  GIVEN,  INTERRUPT,  DATA,  or  ERROR.  The  first  three  of 
these  mean  that  the  link's  holder  has  either  duplicated,  des¬ 
troyed,  or  given  away  the  link  (see  "send"  and  "link").  In  the 
case  of  "DESTROYED",  the  body  of  tne  message  may  contain  data 


placed  there  during  termination  of  the  sender  (see  "die"  and 
"kill").  "INTERRUPT"  is  discussed  under  "handler".  "DATA"  means 
that  the  message  was  sent  by  "send". 

"ERROR"  means  either  that  the  message  was  sent  by  "send", 
but  the  link  had  "MAYERROR"  and  the  sender  specified  "ERROR",  or 
the  link  had  "MAYERROR"  but  not  "TELLDEST"  and  the  holder  ter¬ 
minated  (see  "link") .  Receipt  of  an  error  message  raises  an  ex¬ 
ception  (see  "errhandler " )  . 

The  newly  assigned  link  number  for  the  link  enclosed  with 
the  message  is  reported  in  "urlnenc";  the  caller  now  holds  this 
link) .  If  no  link  was  enclosed,  "urlnenc"  is  -1.  The  length  of 
the  incoming  message  is  reported  in  "urlength".  The  argument 
"data"  must  point  to  a  buffer  of  size  MSLEN  into  which  the  incom¬ 
ing  message,  if  any,  will  be  put.  The  caller  may  discard  the 
message  by  setting  "data"  to  zero.  The  argument  "delay"  gives 
the  time  in  seconds  that  the  caller  is  willing  to  wait  for  a  mes¬ 
sage  on  the  given  channels;  a  "delay"  of  0  means  that  the  call 
will  return  immediately  if  no  message  is  already  there,  and  a 
"delay"  of  -1  means  that  there  is  no  limit  on  how  long  the  caller 
will  wait.  A  process  can  sleep  for  a  certain  amount  of  time  by 
waiting  for  a  message  that  it  knows  won't  come  (e.g.,  on  an 
unused  channel) . 

Returned  values:  0  is  returned  on  success.  -1  (*)  means 
the  caller  has  no  room  for  the  enclosed  link  (see  link;  the  mes¬ 
sage  can  be  successfully  received  later) ,  -2  (*)  means  that  the 
argument  "urmess"  was  bad,  -3  means  that  the  waiting  time  ex¬ 
pired. 
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4.33  Remove  (Service  Call) 
remove ( image) 

The  code  segment  indicated  by  " image" ,  the  return  value  of  a 
successful  "load"  call/  is  removed.  Only  the  process  that  per¬ 
formed  a  “load"  is  allowed  to  subsequently  "remove"  that  image. 

Returned  values:  Success  returns  a  value  of  0.  -1  (*) 

means  that  the  image  either  doesn't  exist  or  is  in  use,  or  that 
the  caller  didn't  originally  load  the  image. 

The  resource  manager  uses  this  call  to  create  space  for  new 
images;  no  other  program  should  use  this  call. 

4.34  Seek  (Library  Routine) 
int  seek(file, offset/mode) 

The  argument  "file"  is  a  link  to  an  open  file.  The  current 
position  in  the  file  is  changed  as  specified  by  the  "offset"  and 
"mode".  A  value  for  "mode"  of  0f  1/  or  2  refers  to  the  begin¬ 
ning,  the  current  position,  or  the  end  of  the  file,  respectively. 
The  "offset"  is  measured  from  the  position  indicated  by  "mode"; 
it  is  unsigned  if  "mode"  *  0,  otherwise  signed.  A  returned  value 
of  0  indicates  success,  -1  indicates  failure. 

4.35  Send  (Service  call) 

int  send(ulink,elink, data, length, dup)  char  *data; 

This  call  sends  a  message  along  link  number  "ulink".  The 
message  body  is  "data"  and  its  length  is  "length".  If  no  message 
is  to  be  sent,  either  "data"  or  "length"  should  be  zero.  If  the 
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caller  wishes  to  pass  another  link  that  it  holds  with  the  mes¬ 
sage,  it  specifies  that  link's  number  in  "elink"  (the  "enclosed 
link").  If  there  is  no  enclosure,  "elink"  should  be  -1.  The  use 
of  elinks  is  restricted  in  various  ways;  see  "link". 

The  argument  "dup"  specifies  either  "DUP"  or  "NODUP";  in  the 
first  case,  the  enclosed  link  is  duplicated  so  that  both  the 
sender  and  receiver  will  hold  links  to  the  same  owner;  in  the 
second  case,  the  enclosed  link  is  given  away  so  that  only  the  re¬ 
ceiver  of  the  message  will  hold  it. 

The  "dup"  argument  also  may  specify  "ERROR"  (this  bit  should 
ue  ored  into  "DUP"  or  "NODUP").  If  "ulink"  has  the  "MAYERROR" 
restriction,  then  an  "ERROR"  message  will  be  sent  to  the  reci¬ 
pient.  If  "MAYERROR"  is  not  set,  then  "ERROR"  has  no  effect. 

Returned  values:  0  is  returned  on  success.  -1  (*)  means 
that  the  ulink  number  is  bad  and  -2  (*)  means  that  the  ulink  is 
invalid.  -3  (*)  and  -4  (*)  have  corresponding  meanings  for  the 
elink.  -5  (*)  means  that  the  message  was  bad,  -6  (*)  means  that 
the  elink  can't  be  duplicated,  -7  (*)  means  that  the  elink  can't 
be  given  away,  and  -8  (*)  means  the  message  is  too  long. 

No  error  is  reported  if  the  destination  process  has  ter- 
minated;  in  this  case,  the  message  is  discarded. 

4.36  Setdate  (Service  Call) 

setdate(n)  long  n; 

This  service  call  sets  the  wall  clock  to  ”n”,  which  is  a 
long  integer  representing  the  number  of  seconds  since  midnight. 
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Jan  1,  1973. 

Only  the  command  interpreter  and  resource  manager  should  use 
this  call. 


4.37  Startup  (Service  Call) 


int  startup(image,arg, plink, dup, fd) 

This  call  starts  a  process  whose  code  segment  is  indicated 
by  "image",  the  return  value  of  a  successful  "load”  call.  The 
child  is  given  "arg"  as  its  argument  to  "main".  The  child's  link 
number  0  is  "plink",  a  link  owned  by  the  caller;  this  link  is  ei¬ 
ther  given  to  the  child  or  duplicated  depending  on  whether  "dup" 
is  NODUP  or  DUP,  respectively.  The  child  cannot  destroy  link  0. 
For  C  programs,  the  data  area  is  part  of  the  image;  for  Elmer 


programs,  "startup"  causes  a  new  data  area  to  be  created.  The 
"fd"  argument  should  be  a  link  number  for  an  open  file  that  holds 
the  Elmer  program;  it  is  used  to  load  the  data  segment. 

Returned  values;  Success  returns  a  non-negative  lifeline 
number,  which  can  be  used  for  a  subsequent  "kill".  -1  (*)  means 
that  the  caller  had  no  room  for  the  lifeline  (see  "link").  -2 


{*)  or  -3  (*)  me$ns  that  the  "plink"  was  out  of  range  or  had  an 
invalid  destination,  respectively.  -4  means  that  there  was  no 
room  for  the  new  process'  stack  (or  data  area:  Elmer  only) .  -5 
(*)  means  that  the  "image"  was  invalid,  -6  (*)  means  that  "image" 
is  an  Elmer  program,  and  "fd"  is  bad. 


Only  the  resource  manager  should  use  this  call. 
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4.38  S  tat  (Library  Rout  ine) 


int  stat  ( f  si  ink ,  f  name  ,  statbuf )  char  statbuf[36); 

This  library  routine  gives  information  about  the  file  named 
"fname".  The  argument  "fslink"  is  the  caller's  link  to  the  file 
manager.  An  error  returns  a  value  of  -1.  After  a  successful 
call,  the  contents  of  the  36-byte  buffer  "statbuf"  have  the  fol¬ 
lowing  meaning: 
struct{ 


char 

minor ; 

minor  device  of  i-node 

char 

maj  or ; 

major  device 

int 

i number; 

int 

flags; 

char 

nl inks; 

number  of  links  to  file 

char 

uid; 

user  ID  of  owner 

char 

gid; 

group  ID  of  owner 

char 

sizeO; 

high  byte  of  24-bit  size 

int 

sizel; 

low  word  of  24-bit  size 

int 

addr [8] ; 

block  numbers  or  device  number 

long 

actime; 

time  of  last  access 

long 

mod time; 

time  of  last  modification 

}  *buf; 

NOTE : 

Some  of  these  fields  are  irrelevant,  since  all  Arachne  files 
have  the  same  owner. 


4.39  Time  (Service  C a  1 3. ) 
long  time(); 

This  service  call  returns  a  long  integer  that  may  be  used 
for  timing  studies.  The  integer  is  a  measure  of  time  in  inter¬ 
vals  of  ten-thousandths  of  seconds.  NOTE:  The  time  wraps  around 
after  a  full  double  word  (32  bits) . 
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4.40  Uni  ink  (Library  Routine) 

int  uni  ink ( f si  ink ,f name)  char  *fname; 

This  library  routine  removes  the  file  named  "fname";  it 
cleans  up  after  "create"  and  "alias".  The  argument  "fslink"  is 
the  caller's  link  to  the  file  manager.  Error  returns  a  value  of 
-1. 

4.41  Wr ite  (Library  Routine) 

write(f ile,buf ,size)  char  *buf; 

The  argument  "file"  is  either  a  link  to  an  open  file  or  a 
terminal  output  link.  Using  this  link,  "size"  bytes  are  written 
from  the  buffer  "buf".  There  are  no  return  values. 

5.  CONSOLE  COMMANDS 

The  Command  Interpreter  is  a  utility  process  that  reads  the 
teletype.  When  the  Command  Interpreter  is  awaiting  a  command,  it 
types  the  prompt  A  command  consists  of  a  sequence  of  "argu¬ 

ments"  separated  by  spaces.  Otherwise,  spaces  and  tabs  are  ig¬ 
nored  except  when  included  in  quotation  marks  (").  Within 
quotes,  two  consecutive  quotes  denote  one  quote;  otherwise,  quo¬ 
tation  marks  are  deleted.  The  first  "argument"  is  interpreted  as 
a  "command"  (see  below) .  Command  names  may  be  truncated,  provid¬ 
ed  the  result  is  unambiguous.  It  is  intended  that  all  commands 
will  differ  in  their  first  three  characters. 

The  "run"  command  may  be  followed  by  from  one  to  MAXCOMS  (4) 


'# 


* 


commands  separated  by  the  symbol  .  The  terminal  output  of  the 
command  to  the  left  of  a  is  buffered  by  a  special  "pipe"  pro¬ 

cess,  and  fed  as  though  it  were  terminal  input  to  the  command  to 
the  right  of  the  .  The  output  from  the  last  process  in  a  pipe 
may  be  redirected  to  a  file  by  following  it  with  "  *  to  out- 
filename".  The  input. to  the  first  process  in  a  pipe  may  be  ob¬ 
tained  from  a  file  by  preceding  it  with  "from  infilename  * 
Although  "to"  and  "from"  appear  to  be  the  names  of  processes  in 
the  pipe,  they  do  not  count  towards  the  MAXCOMS  maximum.  Furth¬ 
ermore,  "to"  and  "from"  are  reserved  words  to  the  command  inter¬ 
preter,  and  hence  neither  may  be  the  name  of  a  user  program. 

The  following  is  an  alphabetized  list  of  console  commands. 

5.1  al ias  <f ilenamel>  <f ilename2> 

The  second  indicated  file  becomes  another  name  for  the  first 
indicated  file,  if  either  of  these  is  "deleted",  the  other  (log¬ 
ical)  copy  still  exists;  however,  changes  to  either  affect  both. 

5.2  background  <f i.lename>  <arg> 

The  indicated  file  must  be  executable.  It  is  started  as  a 
BACKGROUND  process,  with  the  integer  argument  "arg".  The  Command 
Interpreter  prints  out  the  new  process's  process  identifier, 
which  may  be  used  for  subsequent  "killing"  and  then  gives  the 
next  prompt. 
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5.3  copy  < filename! >  < f ilename2 > 


The  second  indicated  file  is  created  with  a  copy  of  the  con¬ 
tents  Of  the  first  indicated  file. 

5.4  delete  <f ilename> 

The  indicated  file  is  deleted. 

5 . 5  dump  <gddress> 

Prints  a  screenful  of  memory  locations  in  octal  for  debug¬ 
ging. 

5.6  help 

A  list  of  available  commands  is  displayed. 

5 . 7  kill  <arg> 

The  indicated  argument  should  be  the  process  identifier  re¬ 
turned  from  a  previous  "background"  command.  The  process  re¬ 
ferred  to  by  the  process  identifier  is  killed. 

5 . 8  make  <f ilename> 

The  named  file  is  created.  Subsequent  input  is  inserted 
into  the  file;  the  input  is  terminated  by  a  control-D. 


5.9  rename  <oldname>  <newname> 


The  name  of  file  "oldname"  is  changed  to  "newname" . 

5.10  run  < f ilename>  {  <arg>  }  {  *  <f ilename>  {  <arg>  }  } 

The  indicated  files  should  be  executable  files.  The  right¬ 
most  one  is  run  as  a  FOREGROUND  process.  The  others  are  run  as 
BACKGROUND  processes.  The  Resource  Manager  is  given  a  REQUEST 
link,  which  the  new  process  may  use  to  ask  for  the  command  line 
arguments.  When  the  loaded  program  starts  up,  the  argument  to 
"main”  tells  the  number  of  command  line  arguments.  To  get  the 
individual  arguments,  the  loaded  program  sends  a  message  to  the 
Command  Interpreter  (its  parent).  The  first  word  of  the  message 
is  ARGREQ,  and  the  second  is  an  integer  specifying  which  argument 
is  desired.  The  name  of  the  program  is  argument  number  0.  The 
returned  message  body  is  the  argument,  which  is  a  null-terminated 
string  of  length  at  most  MSLEN. 

5.11  set  <model ist>  or  SET  Cmodel ist> 

This  command  Changes  the  console  input  modes.  The  mode  list 
is  a  sequence  of  keywords  "x"  or  ”-x”,  where  "x"  can  be  any  of 
the  following; 


upper 

( the 

echo 

( the 

hard 

( the 

tabs 

( the 

terminal 

tc?rminal 

terminal 

terminal 


is  upper  case) 
echoes  input) 
is  hard-copy) 
has  hardware  tabs) 


Keywords  may  be  abbreviated  according  to  the  same  rules  as  com¬ 
mands.  The  format  Mx"  turns  on  the  corresponding  mode,  "-x" 


turns  it  off.  (UPPER  is  recognized  for  upper;  "lower"  means 
upper".)  For  more  information,  see  the  section  "CONSOLE  INPUT 


PROTOCOLS" . 


5.12  time  <format> 


If  a  format  is  given  (as  "yymmddhhmm") ,  the  wall  clock  is 
set  to  that  time  and  printed.  With  no  argument,  "time"  prints 
the  wall  clock  time. 


5.13  type  <f ilename> 


The  indicated  file  is  typed. 


6.  TERMINAL  INPUT  PROTOCOLS 


The  terminal  driver  performs  interrupt-driven  I/O,  which  al¬ 
lows  for  typing  aliead.  Also,  the  following  characters  have  spe¬ 


cial  meanings: 


Control-C 

command  interpreter 
Control-I) 
line") 

Control-K 
line-feed 
carriage  return 
rubout 
Control-X 
Contro.l-S 


kill  the  running  program  (but  don't  kill  the 
itself) 

end  of  file  (terminates  a  "read"  or  "read- 

end  of  line  (but  no  character  sent) 
end  of  line 
end  of  line 

erase  last  character  (unless  line  empty) 
erase  current  line 

enter  scroll  mode;  pause  every  13  lines  of 


output;  if  paused,  allow  the  next  18  lines  to  be  printed. 

Control-Q  leave  scroll  mode;  if  paused,  allow  output  to 

continue . 

escape  next  character  should  be  sent  as  is 

Ip  "echo"  mode,  input  is  echoed,  otherwise  not.  In  "hard" 
mode,  output  is  designed  to  be  legible  on  hardcopy  devices;  oth¬ 
erwise  the  terminal  driver  assumes  that  the  cursor  can  move  back¬ 
ward,  as  on  a  CRT.  In  "tabs"  mode,  advantage  is  taken  of 
hardware  tabs  on  the  terminal.  In  "upper"  mode,  the  terminal  is 
assumed  to  only  have  upper  case.  Input  is  converted  to  lower 
case,  unless  escaped.  Upper  case  characters  are  printed  and 
echoed  with  a  preceding  "!".  Escaped  (,  ],  @,  “ ,  and  \  are  con¬ 
verted  to  {,  },  ',  ~,  and  |,  respectively,  and  the  latter  are 
similarly  indicated  by  preceding  "!"s. 

7.  UTILITY  PROCESS  PROTOCOLS 

This  section  describes  the  protocols  that  user  programs  must 
follow  to  communicate  with  the  utility  processes  when  the  library 
routines  described  earlier  are  inadequate.  Four  utility 
processes  are  the  resource  manager,  the  file  manager,  the  termi¬ 
nal  driver,  and  the  command  interpreter.  The  resource  manager 
keeps  track  of  which  programs  are  loaded  and/or  running  on  the 
local  machine.  The  kernel  and  the  resource  manager  reside  on 
each  machine.  The  terminal  driver  governs  I/O  on  the  console; 
the  command  interpreter  interprets  console  input.  The  file 
manager  implements  a  file  system  by  communicating  with  the  POP- 


n/40.  It  need  not  exist  on  every  machine. 

During  Arachne  initialization,  one  resource  manager  is 
started.  It  loads  a  full  complement  of  utility  processes  (the 
terminal  driver,  command  interpreter,  and  file  manager)  on  its 
machine  and  various  utility  processes  on  the  other  machines. 
When  a  particular  resource  manager  is  not  given  a  local  terminal 
driver  or  file  manager,  it  shares  the  one  on  the  initial  machine. 

7 . 1  Input/Output  Protocols 

This  section  describes  the  message  formats  used  for  communi¬ 
cating  with  the  file  manager  and  terminal  driver  processes.  A 
program  that  explicitly  communicates  with  the  file  manager  or 
terminal  driver  must  include  the  header  files  "filesys.h"  and 
"ttdriver .h" ,  which  define  the  necessary  structures. 

To  open  an  input  or  output  line  to  the  terminal,  to  change 
the  modes  on  the  terminal,  or  to  inform  the  teletype  of  whom  it 
should  kill  when  encountering  a  control-C,  a  message  is  sent  over 
the  terminal  link  of  the  following  form: 

struct  ttinline{ 
char  tticom; 
char  ttisubcom; 
char  ttimodes; 

} 

"tticom"  is  either  OPEN,  STTY,  MODES,  or  TOKILL  In  the  case  of 
OPEN,  "ttisubcom"  is  either  READ  or  WRITE,  and  the  return  message 
has  the  new  link  enclosed  In  the  case  of  STTY,  "ttimodes"  tells 
what  the  now  modes  should  be  (a  bit-wise  sum  of  ECHO,  TABS,  HARD, 
and  UPPER).  In  the  case  of  MODES  (to  find  out  the  current 
modes),  the  return  message  has  the  modes  in  "ttimodes".  In  the 
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« 


case  of  TOKILL  (to  inform  the  terminal  driver  which  process  to 
kill  on  receipt  of  control-C)  ,  the  message  encloses  a  lifeline. 

To  open,  create,  unlink,  alias,  or  get  status  information  on 
a  file,  a  message  is  sent  over  the  file  manager  link  in  the  fol¬ 
lowing  form: 

struct  ocmesg{ 

int  ocaction; 

int  ocmode; 

char  fsname[MSLEN-4] ; 

} 

"ocaction"  is  either  OPEN,  CREATE,  UNLINK,  ALIAS,  or  STAT.  "oc¬ 
mode"  is  the  mode  for  OPEN  or  CREATE;  in  the  case  of  ALIAS,  it 
holds  the  length  of  the  first  file  name,  "ocname"  contains  the 
file  name  (or,  in  the  case  of  ALIAS,  the  concatenation  of  two 
file  names) ,  null-terminated.  In  the  cases  of  OPEN  or  CREATE,  a 
successful  return  contains  a  valid  enclosed  link;  for  UNLINK, 
STAT,  or  ALIAS,  there  is  no  enclosed  link,  in  the  case  of  STAT, 
the  return  message  has  the  structure  of  a  "rdmesg"  as  in  the  case 
of  READ  below;  the  response  has  length  36  or  0,  corresponding  to 
success  or  failure,  respectively.  In  all  other  cases,  the 
response  is  one  word:  0  on  success,  -1  on  failure. 

For  either  the  terminal  or  the  file  manager,  reading  or 
writing  is  done  by  sending  a  message  of  the  following  form: 

struct  fsmesg{ 

int  fsaction; 
int  fslength; 

} 

"fsaction"  should  be  either  READ,  READLINE,  or  WRITE,  "fslength'’ 
tells  how  many  bytes  are  intended  to  be  read,  or  are  being  sent, 
to  be  written.  In  the  case  of  WRITE,  the  text  is  sent  in  subsc- 
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quent  messages,  and  nothing  is  returned.  In  the  cases  of  READ  or 

READLINEy  s  the  response  is  of  the  following  form: 

struct  rdmesg{ 

char  rdtext(MSLEN) ; 

} 

The  maximum  allowable  read  is  size  MSLEN.  The  actual  size  of  the 
returned  message  is  contained  in  the  "urlength"  field. 

To  perform  .a  seek  on  an  open  file,  send  a  message  to  the 
file  manager  of  the  following  form: 
struct  skmesg{ 

int  skaction;  /*  should  be  SEEK  */ 
int  skoffset; 
int  skmode; 

} 

The  return  message  is  one  word:  0  for  success,  -1  for  failure. 

nt  7.2  Resource  Manager  Protocols 

Processes  that  communicate  explicitly  with  the  resource 
manager  must  include  the  header  file  "resource. h" .  The  following 
structure  is  declared  there: 

struct  rmmesg  {  /*  messages  to  resource  managers  */ 

int  rmreq;  /*  type  of  request  */ 

int  rraarg;  /*  various  miscellaneous  arguments  */ 

int  rmmode;  /*  the  mode  for  STARTs  or  KILLS  */ 

} 

The  resource  manager  keeps  track  of  which  images  (code  seg¬ 
ments)  and  processes  exist.  A  separate  resource  manager  runs  on 
each  machine  in  the  network;  these  programs  communicate  with  each 
other,  but  are  relatively  independent. 

Each  resource  manager  holds  a  terminal  link  and  file  manager 
link,  which  are  either  for  local  utility  processes  or  else  links 
received  from  the  first  resource  manager  initialized.  When.  v«.r  a 


resource  manager  has  a  local  terminal  it  also  has  a  local  command 
interpreter. 

There  are  three  kinds  of  processes:  FOREGROUND,  BACK¬ 
GROUND,  and  DETACHED.  When  a  process  is  started,  its  link  0  is 
owned  by  the  local  resource  manager,  to  whom  all  of  this 
process's  requests  are  directed. 

The  first  FOREGROUND  process  for  any  terminal  is  always  the 
command  interpreter,  which  initially  "has  the  ball".  Each  termi¬ 
nal  always  has  one  FOREGROUND  process  that  "has  the  ball".  The 
process  "with  the  ball"  may  create  another  FOREGROUND  process, 
which  means  that  the  child  now  "has  the  ball".  The  meaning  of 
“having  the  ball"  is  that  a  control-C  entered  on  the  correspond¬ 
ing  terminal  will  terminate  the  process.  When  the  process  "with 
the  ball"  terminates,  its  parent  then  "recovers  the  ball",  and 
will  be  terminated  by  the  next  control-C.  If  one  of  the 
processes  in  this  FOREGROUND  chain  terminates,  the  chain  is  re¬ 
linked  appropriately.  The  command  interpreter  is  an  exception  in 
that  control-C's  have  no  effect  on  it. 

A  process  may  also  create  another  process  as  a  BACKGROUND 
process.  In  this  case,  the  child's  process  identifier  is  re¬ 
turned  to  the  parent,  and  later  the  parent  can  use  this  identif¬ 
ier  to  terminate  the  child.  These  identifiers  are  assigned  by 
the  resource  manager,  and  are  distinct  from  the  process  identif¬ 
iers  used  in  the  kernel. 

A  DETACHED  process  cannot  be  terminated  by  cither  method. 

A  user  may  make  five  kinds  of  requests  on  its  resource 
.'•an a  j  o  r  : 
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1.  RMTTREQ  Request 


The  resource  manager  is  requested  to  give  the  requestor  a 
link  to  the  requestor's  terminal.  This  link  will  be  sent  over 
the  enclosed  link  in  the  request,  which  should  therefore  be  a  RE¬ 
PLY  link.  • 

2.  RMFSREQ  Request 

The  resource  manager  duplicates  its  file  manager  link  and 
sends  it  back  over  the  enclosed  link  in  the  request,  which  should 
therefore  be  a  REPLY  link. 

3.  RKSTART  Request 

The  resource  manager  will  start  a  process,  using  the  link 
enclosed  with  this  request  for  two  purposes:  1)  to  respond  to 
the  request  (see  conditions  for  response  below) ,  or  2)  to  save  it 
and  give  to  the  child  if  the  child  asks  for  it  (see  RMPLINK 
below) .  The  caller  must  be  careful,  of  course,  not  to  give  a  RE¬ 
PLY  link  if  both  uses  are  intended.  Also,  the  caller  must  make 
the  enclosed  link  GIVEALL  if  the  resource  manager  should  try  to 
load  the  process  on  another  machine,  rather  than  giving  up  if  it 
doesn't  fit  on  the  local  one.  The  RM START  request  also  specifies 
the  file  name  and  an  integer  argument  to  be  given  to  the  child 
when  it  starts. 

The  caller  also  specifies  a  "mode"  for  starting  the  child, 
which  is  a  combination  of  bits  with  various  meanings.  The  user 
should  specify  either  BACKGROUND,  FOREGROUND,  or  DETACH i:n  (the 


default  is  DETACHED).  FOREGROUND  is  only  allowed  if  the  reques¬ 
tor  currently  "has  the  ball"  for  its  terminal.  The  user  may 
specify  EXCLUSIVE,  which  causes  the  resource  manager  to  load  it 
on  a  machine  only  if  there  is  no  like-named  core  image  with  its 
EXCLUSIVE  bit  set,  on  that  machine.  The  user  should  specify  ei¬ 
ther  SHARE,  REUSE,  EXCLUSIVE,  or  VIRGIN-(the  default  is  VIRGIN). 
These  alternatives  are  described  above  (see  "fork").  The  user 
should  also  specify  either  GENTLY  or  ROUGHLY  (the  default  is 
GENTLY) .  If  GENTLY,  the  resource  manager  will  first  try  to  load 
it  locally  without  throwing  out  any  other  unused  images  and  then 
will  try  to  do  the  same  on  other  machines.  When  this  fails,  or 
if  ROUGHLY  was  specified,  it  tries  to  make  room  locally  for  the 
new  process,  and  then  tries  to  do  so  on  other  machines.  The  user 
should  also  specify  either  ANSWER  or  NOANSWER  (the  default  is 
NOANSWER).  If  ANSWER  is  specified,  or  if  BACKGROUND  was  speci¬ 
fied,  then  the  resource  manager  sends  a  reply  over  the  enclosed 
link.  The  first  word  of  the  reply  is  the  return  code;  -1  always 
means  failure;  0  means  success  except  in  the  case  of  BACKGROUND, 
when  the  value  returned  is  the  process  identifier  of  the  child. 

An  existing  code  segment  is  reusable  if  the  filename  still 
refers  to  an  existing  publicly  executable  load  format  file  that 
has  not  been  modified  since  the  copy  in  question  was  loaded.  Any 
number  of  processes  may  share  a  code  segment.  The  terminal  asso¬ 
ciated  with  a  child  process  is  always  the  same  as  the  one  associ¬ 
ated  with  its  parent;  the  command  interpreter  is  loaded  with  a 
terminal  during  initialization. 
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4.  RMKILL  Request 


The  resource  manager  kills  the  process  whose  process  iden¬ 
tifier  is  given  as  part  of  the  request.  The  request  may  enclose 
a  link  that  is  used  to  give  a  one-word  acknowledgement  of  success 
or  failure  if  the  request  specifies  ANSWER  (as  in  RMSTART, 
described  above) .  The  process  being  killed  must  of  course  be 
BACKGROUND,  and  only  the  process  that  started  it  is  allowed  to 
kill  it. 


5.  RMPLINK  Request 


The  resource  manager  returns  the  link  that  was  originally  en¬ 
closed  with  the  request  that  started  this  process.  It  is  re¬ 
turned  over  the  link  enclosed  with  the  RMPLINK  request,  which 
must  therefore  be  of  the  proper  type,  whichever  that  may  be. 
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